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study  a  continuous  idealisation  of  the  PET  reconstruction  problem, 
considered  as  an  example  of  bivariate  density  estimation  based  on  indirect 
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indirect  and  for  direct  observations,  -we  estabKsh;>the  exact  minimax  rates  ^ 
convergence  of  estimation,  for  all  possible  estimators,  over  suitable 
smoothness  classes  of  functions.  For  indirect  data  and  (in  practice 
unobservable)  direct  data,  these  rates  are  (n/logn)"^/^/’’*''^ 
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1  Introduction 

Tomography  is  a  non-invasive  technique  for  reconstructing  the  internal  structure 
of  an  object  of  interest,  often  in  a  medical  context.  Positron  emission  tomography 
(PET)  deals  with  the  estimation  of  the  amount  and  location  of  a  radioactively  labeled 
metabolite  on  the  basis  of  particle  decays  indirectly  observed  outside  the  body. 
Emission  tomography  in  general,  and  PET  in  particular,  has  been  the  subject  of 
considerable  recent  research  in  nuclear  medicine, .  and  has  attracted  the  interest  of 
statisticians  as  an  example  of  a  reconstruction  problem  involving  incomplete  and  noisy 
data.  Some  references  are  given  in  Sections  2  and  3  below. 

In  a  typical  tomography  scan,  a  large  number,  perhaps  ten  million,  radioactive 
emissions  are  recorded,  and  the  image  of  interest,  a  slice  through  the  patient’s  brain  or 
body,  is  reconstructed  in  some  way  from  this  apparently  vast  data  set.  But  is  ten 
million  observations  really  a  large  sample  in  this  kind  of  context?  As  will  be 
explained  below,  the  observations  are  not  direct  observations  in  the  usual  statistical 
sense,  because  the  original  positions  of  the  detected  emissions  in  the  brain  are  not 
known.  All  that  the  equipment  is  able  to  detect  is  a  tube  in  space  within  which  each 
particular  emission  took  place,  and  thus  the  observations  can  be  thought  of  as  indirect 
observations  of  the  phenomenon  of  interest.  This  incompleteness,  or  indirectness,  of 
the  observed  data  is  of  course  a  standard  feature  of  linear  inverse  problems  in  general, 
and  PET  is  just  one  concrete  example  of  such  problems. 

One  way  of  gaining  some  insight  into  the  problem  is  to  think  in  terms  of 
equivalent  sample  sizes.  We  make  some  smoothness  assumptions  about  the  image  of 
interest,  and  then  ask  how  accurately  it  could  possibly  be  reconstructed  given  a 
particular  indirect  sample.  The  equivalent  sample  size  would  be  the  number  of 
emissions  whose  original  positions  could  yield  an  equally  accurate  estimate.  The 
equivalent  sample  size  gives,  in  terms  more  attuned  to  usual  statistical  intuition,  a 
quantification  of  the  amount  of  information  actually  available  from  our  sample  of  ten 
million  indirectly  observed  emissions,  and  hence  gives  an  idea  of  how  much  is  lost  by 
the  indirect  nature  of  the  observation  process. 

This  paper  has  several  objectives.  We  provide  a  continuous  formulation  of  a 
simplified  model  of  the  PET  reconstruction  problem.  This  allows  us  to  formulate  the 
reconstruction  problem  as  an  example  of  nonparamctric  bivariate  density  estimation 
based  on  indirect  data,  in  fact  an  example  of  a  linear  inverse  problem  in  a  function 
space.  This  continuous  framework  is  flexible  enough  to  include  the  estimators 
proposed  in  the  literature  for  practical  use,  and  reviewed  in  Section  3  below.  These 
estimators  are  mostly  based  on  a  discretisation  of  the  problem;  although  the  continuous 

1  d;,' 


'  •- 


3’ 


■t 

□ 

□ 


'ty  Codes 


I  or 


I.  M.  JOHNSTONE  &  B.  W.  SILVERMAN 


POSITRON  EMISSION  TOMOGRAPHY 


formulation  might  be  used  as  the  basis  for  practical  reconstructions,  our  orientation  in 
the  present  paper  is  mainly  theoretical.  We  do  not  consider  explicitly  the  iterative 
non-linear  algorithms  proposed  for  practical  use,  but  instead  we  establish  a  lower 
bound  (over  a  suitable  smoothness  class  of  functions)  to  the  rate  of  convergence  of  all 
estimators.  We  then  show  that  this  rate  bound  is  sharp  by  computing  the  rate  of 
convergence  of  a  particularly  simple  linear  density  estimate  based  on  orthogonal  series. 
For  these  estimators,  and  for  a  surrogate  to  the  mean  integrated  square  error,  we  give 
some  numerical  values  for  equivalent  sample  sizes,  and  these  are  discussed  in  Section 
4  below. 

Concluding  sections  of  the  paper  test  the  sensitivity  of  our  results  to  perturbations 
in  our  mathematical  idealisation.  We  consider  the  effect  of  allowing  for  the  three 
dimensional  nature  of  the  problem  and  for  error  measures  incorporating  local  variation 
and  find  that  our  qualitative  conclusions  are  not  significantly  affected. 

We  have  tried  to  present  our  results  in  such  a  way  that  readers  prepared  to  take 
the  more  teclinical  details  on  trust  need  read  the  first  four  Sections  only. 

A  subsidiary  objective  of  the  paper  is  to  illustrate,  in  a  relatively  simple  and 
concrete  setting,  the  general  approach  to  deriving  lower  bounds  to  estimation  risk 
developed  by  Le  Cam,  Ibragimov  and  Hasminskii,  and  Birg^.  This  method  relates  the 
best  possible  speed  of  estimation  (in  a  given  "global”  metric)  to  the  metric  entropy 
structure  of  the  parameter  space.  We  need  a  minor  modification  to  handle  the  present 
indirect  estimation  setting,  introducing  a  form  of  "modulus  of  continuity"  of  the 
inverse  transform. 

A  basic  tool  throughout  the  paper  is  the  singular  value  decomposition  (SVD)  of 
the  Radon  transform  linking,  in  a  sense  that  will  be  discussed  below,  the  unknown 
density  with  the  observed  data.  Even  when  we  do  not  know  the  SVD,  as  in  one 
modification  of  the  PET  problem  discussed  in  Section  10,  the  lower  and  upper  bounds 
differ  by  only  logarithmic  terms. 

Although  we  unashamedly  concentrate  on  positron  emission  tomography,  our 
methods  are  relevant  to  other  indirect  estimation  problems  as  well,  especially  when  the 
singular  value  decomposition  is  available,  as  is  discussed  briefly  in  the  concluding 
Section  11. 


'  ‘  I.  M.  JOHNSTONE  &  B.  W.  SILVERMAN 


POSITRON  EMISSION  TOMOGRAPHY 


2.  The  positron  emission  tomography  problem 
2.1  Introduction 

We  study  the  positron  emission  tomography  (or  PET)  problem  in  this  paper  for 
two  reasons.  Firstly,  it  is  of  important  practical  interest  in  its  own  right,  and  secondly 
it  is  a  typical  example  of  a  problem  involving  indirect  observation  of  the  probability 
distribution  of  actual  interest.  Other  such  problems  arise  in  tomography  of  different 
kinds,  as  well  as  in  many  other  fields,  and  a  detailed  study  of  one  particular  indirect 
observation  problem  will  yield  insight  into  the  more  general  cases. 

The  formulation  of  the  PET  problem  we  shall  consider  is  basically  that  given  by 
Shepp  and  Vardi  (1982)  and  Vardi,  Shepp  and  Kaufman  (1985).  Following  their  con¬ 
vention  we  shall  consider  a  particular  PET  experiment,  where  the  brain  is  scanned  by 
counting  radioactive  emissions  from  tagged  glucose.  The  distribution  of  glucose 
within  the  brain  corresponds  to  the  glucose  uptake  mechanism,  and  so  a  map  of  the 
glucose  distribudon  within  the  brain  gives  an  indication  of  the  pattern  of  the  brain’s 
metabolic  activity. 

The  radioactive  tagging  of  the  glucose  gives  rise  to  emissions  of  positrons  distri¬ 
buted  as  a  Poisson  process  in  space  and  time;  the  spatial  intensity  of  emissions  is  the 
same  as  the  distribution  of  glucose.  Each  positron  that  is  emitted  annihilates  with  a 
nearby  electron,  and  yields  two  photons  that  fly  off  in  opposite  directions  along  a  line 
with  unifomily  distributed  orientation.  One  or  more  rings  of  sensors  placed  around  the 
patient’s  head  make  it  possible  to  detect  the  photon  pairs  and  hence,  for  each  emission 
that  is  detected,  to  give  a  line  on  which  the  point  of  emission  must  have  lain.  How¬ 
ever  (for  equipment  of  the  kind  being  considered  at  the  moment)  it  is  not  possible  to 
detect  the  position  of  the  emission  on  the  line.  In  practice  the  sensors  are  of  finite  size 
and  so  what  is  actually  observed  is  the  numbers  of  emissions  over  a  period  of  time  fal¬ 
ling  in  various  detector  tubes,  each  tube  made  up  of  the  cylindrical  volume  between  a 
pair  of  detectors.  There  are  various  simplifying  assumptions  made  in  the  description 
we  have  given.  For  example,  in  reality  the  positrons  may  well  travel  a  finite  distance 
before  annihilation  and,  furthermore,  once  the  annihilation  has  taken  place  the  two 
photons  may  fly  off  at  an  angle  of  slightly  less  than  180®  to  one  another.  For  further 
discussion  of  the  problem  see  the  papers  mentioned  above,  and  also  Section  1.3  of 
Deans  (1983). 
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2.2  An  idealised  problem  and  the  Radon  transform 

The  problem  we  shall  consider  in  detail  is  an  idealised  version  of  the  PET  prob¬ 
lem.  The  ring  of  detectors  defines  a  slice  of  the  patient’s  head,  and  the  reconstruction 
aims  to  display  a  picture  of  the  glucose  density  within  that  slice.  Emissions  that  give 
rise  to  photon  pairs,  one  or  both  of  which  miss  the  detector  ring,  will  go  unrecorded. 
Bearing  this  in  mind,  we  shall  regard  the  slice  as  a  plane  and  consider  an  essentially 
two-dimensional  problem  where  (see  Fig.  2.1)  enussions  take  place  in  the  plane 
according  to  some  density  within  a  detector  circle  taken  to  be  the  unit  circle  in  the 
plane.  An  emission  at  P  gives  rise  to  a  photon  pair  whose  directions  of  flight  lie  in 
the  plane  along  a  line  /  through  P  with  random,  uniformly  distributed,  orientation.  The 
finite  size  of  the  detectors  is  ignored  and  it  is  assumed  that  the  points  B  and  C  of  the 
intersection  of  /  with  the  detector  circle  are  observed  exactly. 


Figure  2,1:  The  patient  and  the  detector  circle 

Give  the  name  detector  space  to  the  space  of  all  possible  unordered  pairs  BC  of 
points  on  the  detector  circle,  and  call  brain  space  the  original  disc  in  the  plane 
enclosed  by  the  detector  ring.  Brain  space  is  parametrised  either  by  cartesian  or  stan¬ 
dard  polar  coordinates;  elements  of  detector  space  are  parametrised  as  {s,<p)  as  shown 
in  Figure  2.2.  The  parameter  s,  OSj^I,  gives  the  length  of  the  perpendicular  from  the 
origin  to  the  detected  line  BC,  while  <p,  0^<p^2n,  gives  the  orientation  of  this  perpen¬ 
dicular. 
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Figure  2.2:  Parametrising  the  line  I 

We  now  define  dominating  measures  on  brain  space  and  on  detector  space. 
Define  a  measure  n  on  brain  space  to  be  lebesgue  measure,  so  that 

d^(r,e)  =  7i~^ rdrdd  for  and  O^0<2;r  if  polar  coordinates  are  used,  and 

d^(Xi,X2)  =  ft~^dxidx2  for  in  Cartesian  coordinates.  The  factor  )r~^  is 

included  to  make  n  integrate  to  1.  On  detector  space,  define  a  measure  A  by 
dX(s,q>)  =  2tr~^(\-s^)^dsdp.  This  measure  also  integrates  to  1  ;  the  reason  for  the 
form  of  X  will  be  explained  in  a  moment. 

Suppose  an  emission  takes  place  at  a  point  distributed  with  probability  density 
f{xi  ,X2)  with  respect  to  /x  in  brain  space.  Let  g  =  F/  be  the  probability  density  in 
detector  space,  with  respect  to  A,  of  the  corresponding  detection  of  a  pair  of  photons, 
so  that  the  mapping  P  maps  the  actual  density  of  emissions  to  the  corresponding 
observable  density  in  detector  space.  We  shall  show  below  that  Pf  is  given  by 

Pf{s,q>)  =  cos  <p  -  t  sin  j  sin  ?>  +  r  cos  <p)dt  (2.1) 

The  integral  in  (2.1)  is  the  so-called  Radon  transform  of  the  density  /,  namely  the  line 
integral  of  /  along  the  line  /  with  co-ordinates  {s,(p)  in  detector  space.  The  Radon 
transform  has  been  the  subject  of  much  study;  see,  for  example,  Marr  (1974)  and 
Deans  (1983).  Since  the  length  of  the  segment  BC  is  2(l-j^)i,  it  can  be  seen  at  once 
that  Pfis.tp)  is  the  average  of  /  over  the  part  of  /  that  intersects  the  detector  disc 
||;c||51.  If  /  is  the  uniform  density  in  brain  space,  so  that  /(xj  .Xj)  =  1  for  all  ||x||:Sl, 
then  we  will  have  Pfis,<p)  =  1  for  all  s  and  <p.  Thus  the  probability  measure  A  in 
detector  space  is  the  detector  space  distribution  corresponding  to  the  uniform  measure 
IX  in  brain  space. 
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Figure  2.3:  Rotating  the  coordinates 

It  remains  to  demonstrate  the  correctness  of  (2.1).  Suppose  an  emission  takes 
place  at  (Xi  ,X2)  and  that  the  corresponding  photon  pair  has  trajectory  at  angle  'F  as 
shown  in  Figure  2.3;  we  take  for  definiteness,  and  the  joint  probability  density 

with  respect  to  dxidx2dy/  on  |U1|<1  and  is  given  by 

fxuX^,'V  >X2,y)  =  7t-^nxi,X2) 


using  the  definidon  of  and  the  fact  that  'F  is  independent  of  Xj  and  X2.  Now 
change  variables  by  rotating  the  axes  in  brain  space  by  an  angle  y,  so  that 

Xi  =  ti  cos  y  -  t2  sin  y  and  X2  =  sin  y  +  t2  cos  y. 

The  Jacobian  of  this  transformation  is  1,  and  so  with  respect  to  dt^dt2dy  we  have 

hiJiy  (^i  ^h^¥)  =  fih  cos  y  -  t2  sin  y,t]^  sin  y  +  t2  cos  y) 

on  llri^l  and  O^y^n  . 

Now  integrate  t2  out,  over  the  relevant  range,  to  give 


_9 


f{ti  cos  y  -  t2  sin  y.t^  sin  y  +  t2  cos  y)dt2 


To  give  the  coordinates  in  detector  space  of  the  detected  photon  pair  set 


is,y) 


(ti,y)  if  fi^O 

if  ^i<0  . 


This  transformation  again  has  unit  Jacobian,  and  has  the  property  that 
s  cosy  =  cos  y  and  s  sin  ^  =  tj  sin  y,  so  that  the  joint  density  with  respect  to  dsdy 
is 
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^  ^  cos<p  -  t  sin  <p,s  sin  ^  +  r  cos  <p)dt. 

The  density  (2.1)  with  respect  to  X  follows  at  once  from  the  definition  of  X. 


3.  Outline  of  our  strategy 

Statisticians  are  accustomed  to  dealing  with  samples  of  various  sizes  and  have  a 
certain  kind  of  (sometimes  misguided)  intuition  about  what  constitutes,  say,  a  "large" 
data  set,  and  what  kind  of  information  can  be  extracted  from  such  a  data  set.  Tukcy 
and  Tukey  (1981,  p.  194)  even  attempt  to  formalise  this  intuition  by  suggesting  the  use 
of  letters  of  the  alphabet  A-G  as  a  scale  of  "numerosity"  of  the  data  set.  Vardi,  Shcpp 
and  Kaufman  (1985),  in  their  paper  describing  the  PET  problem,  discuss  the  use  of 
data  sets  of  10^  observed  emissions,  but  because  of  the  indirect  nature  of  the  observa¬ 
tion  process  it  is  not  immediately  clear  how  much  information  such  a  data  set  would 
convey.  In  this  paper  we  shall  attempt  to  gain  some  feeling  for  this  question  by  com¬ 
paring  the  accuracy  with  which  an  emission  density  can  be  estimated  from  indirect 
observations  with  the  accuracy  that  could  be  obtained  if  the  emissions  were  observed 
in  their  original  positions  in  the  brain. 

We  shall  make  this  comparison  in  two  related  ways.  The  first  is  the  equivalent 
sample  size  approach  described  in  the  introduction.  An  alternative,  somewhat  cruder 
way  of  evaluating  the  efficiency  of  statistical  techniques  is  to  consider  the  doubling 
ratio,  defined  to  be  the  relative  increase  r  =  Hln  ln  the  number  of  observations  needed 
to  halve  the  root  mean  square  error  of  estimation.  In  the  case  of  the  estimation  of 
functions,  a  doubling  ratio  can  of  course  be  calculated  with  reference  to  the  root  mean 
integrated  square  error.  For  standard  estimation  procedures  involving  a  finite  number 
of  parameters  the  root  mean  square  error  is  exactly  0(n“^)  and  so  the  doubling  ratio  is 
equal  to  4,  at  least  asymptotically.  We  shall  calculate  and  compare  the  doubling 
ratios,  for  a  particular  estimation  method,  for  the  direct  and  indirect  cases  in  the  PET 
problem,  and  for  suitable  restrictions  on  the  class  of  true  densities  /.  Of  course,  the 
estimation  accuracy  achieved,  and  hence  the  doubling  ratio,  depend  on  the  estimation 
method  being  used.  We  shall  concentrate  most  attention  on  a  linear  estimation  pro¬ 
cedure,  based  on  a  class  of  orthogonal  series  density  estimators.  The  orthogonal  series 
we  shall  use  arises  from  the  singular  value  decomposition  of  the  Radon  transform. 

There  is  a  substantial  literature  on  practical  algorithms  for  reconstruction  in  the 
PET  setting.  An  extensive  survey  covering  the  period  up  to  1979  is  given  by  Budinger, 
Gullberg  and  Heusman  (1979);  this  includes  adaptation  of  methods  from  X-ray 
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transmission  tomography  and  the  orthogonal  series  method  of  Marr  (1974).  Maximum 
likelihood  methods  were  proposed  by  Rockmore  and  Macovski  (1977);  they  were 
implemented  via  the  EM  algorithm  by  Shepp  and  Vardi  (1982)  (see  also  Vardi,  Shepp 
and  Kaufman,  1985)  and  modified  in  various  ways  to  incorporate  smoothing  by 
Geman  and  McClure  (1985)  and  Jones,  Silverman  and  Wilson  (1988).  A  recent  survey 
of  algorithms  is  given  by  Tanaka  (1987). 

As  a  broad  generalisation,  one  might  say  that  many  of  the  algorithms,  especially 
those  based  on  maximum  or  penalised  likelihood,  share  some  common  features.  They 
are  typically  non-linear  and  iterative,  and  respect  the  fact  that  the  emission  density  is 
necessarily  non-negative.  For  these  reasons  such  algorithms  are  difficult  to  study 
analytically  with  regard  to  issues  such  as  quality  of  approximation. 

Our  strategy  is  to  seek  crude  qualitative  information  about  the  nature  of  the 
reconstruction  problem,  exploiting  the  continuous  formulation  given  in  Section  2.2. 
The  continuous  formulation  has  several  virtues: 

1)  it  gives  a  (simplified)  model  of  the  actual  physics  of  the  problem; 

2)  no  grouping  or  discretisation  of  the  data  is  involved,  so  we  can  use  it  for  lower 

bounds  on  all  estimation  procedures,  in  whatever  way  the  data  are  actually  discre- 

tised  in  practice; 

3)  the  singular  value  decomposition  of  the  Radon  transform  can  be  applied  to 

examine  what  kinds  of  features  are  well  reconstructed; 

4)  it  mcy  provide  a  framework  for  construction  of  alternative  algorithms. 

We  shall  study  the  rates  of  convergence  associated  with  estimation  methods  as 
the  sample  size  increases.  We  shall  show  that  there  is  a  lower  bound  to  the  rate  of 
convergence  under  a  smoothness  assumption  on  the  class  of  unknown  emission  densi¬ 
ties.  This  lower  bound  will  apply  to  all  estimators,  linear  or  non-linear,  iterative  or  not. 
To  establish  that  this  lower  bound  is  sharp,  we  evaluate  expliritly  the  rate  of  conver¬ 
gence  of  our  specific  estimator  sequence  based  on  orthogonal  series.  Though  this  esti¬ 
mator  is  linear  in  the  (empirical  Fourier  coefficients  of  the)  data,  and  non-iterative,  it 
turns  out  to  achieve  the  same  rate  of  convergence  as  optimal  non-linear  estimators. 
An  obvious  question,  which  we  intend  to  investigate  in  future  work,  is  whether  our 
linear  estimator  is  at  all  useful  as  a  practical  method  for  finite  samples. 

The  remainder  of  this  paper  is  set  out  as  follows.  For  the  benefit  of  readers 
prepared  to  take  technical  details  on  trust.  Section  4  summarises  our  main  conclusions 
in  terms  of  equivalent  sample  sizes  and  doubling  ratios.  In  Section  5  the  singular 
value  decomposition  of  the  Radon  transform  is  reviewed,  and  details  are  given  of  the 
smoothness  conditions  that  we  shall  place  on  the  unknown  density  /.  Roughly 
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speaking,  a  density  /  will  be  in  smoothness  class  if  its  pth  derivatives  have 
square  integral  bounded  by  C^,  the  integral  being  with  respect  to  a  particular  dominat¬ 
ing  measure.  The  higher  p  is,  the  more  smoothness  is  imposed  on  the  functions  in 
c-  Ouf  linear  estimation  methods  are  introduced  in  Section  6,  and  in  Sections  7 
and  8  we  derive  the  best  possible  rate  of  convergence  that  can  be  achieved  by  these 
linear  estimators  over  any  particular  class  Bp  c-  Section  9  contains  the  main  part  of 
the  argument  that  shows  that  essentially  nothing  is  lost,  at  least  in  terms  of  rates,  by 
considering  only  a  special  class  of  linear  estimators,  since  the  rates  they  achieve  can¬ 
not  be  bettered  by  any  other  estimators,  linear  or  non-linear.  It  has  already  been  noted 
in  Section  2  that  we  are  considering  a  highly  simplified  physical  model  for  positron 
emission  tomography,  and  in  Section  10  we  shall  consider  the  ramifications  of  making 
slightly  more  realistic  assumptions.  Firstly  we  consider  error  measures  that  incorporate 
derivative  information,  and  secondly  we  explore  the  effect  of  the  length-biased  sam¬ 
pling  introduced  by  our  choice  of  a  two-dimensional  idealisation  of  what  is  really  a 
three-dimensional  problem.  Section  1 1  briefly  indicates  other  classes  of  indirect  esti¬ 
mation  problems  to  which  the  methods  of  this  paper  apply.  There  is,  finally,  an 
appendix  containing  technical  details  and  proofs  of  some  of  the  results  presented  in  the 
main  part  of  the  paper. 

We  close  this  section  with  some  remarks  about  discretisation.  Most,  if  not  all, 
current  estimation  methods  in  PET  depend  on  various  discretisations.  For  example, 
there  is  usually  a  finite  number  (say  m=128)  of  detector  elements  in  a  ring  about  the 
patient,  and  hence  a  finite  number  \m{m-\)  of  detector  tubes.  Only  a  ‘binned’  ver¬ 
sion  P*/  of  the  emission  density  Pf  is  observed,  and  the  mapping  f—*P^f  will  have  a 
null  space  about  which  nothing  can  be  said  (even  asymptotically  in  n)  without  employ¬ 
ing  further  information  about  /.  We  do  not  attempt  to  discuss  dicretisation  effects  in 
this  paper,  since  our  focus  is  on  the  continuous  idealisation  of  the  PET  problem  and  its 
attendant  simplicity.  Thus  our  results  can  be  construed  either  as  theoretical  limits  that 
apply  regardless  of  discretisation,  or  as  being  applicable  should  "position-sensitive 
detectors"  become  available  for  data  collection.  Our  results  should  be  realistic  for 
combinations  of  sample  size  and  discretisation  level  for  which,  over  the  appropriate 
smoothness  class,  the  maximum  bias  due  to  discretisation  is  small  relative  to  the  total 
minimax  mean  square  error  for  that  sample  size.  TTiis  maximum  bias  can  be  calcu¬ 
lated,  at  least  in  principle,  once  a  particular  detector  geometry  and  division  of  the 
image  space  into  pixels  have  been  specified.  All  this  assumes,  of  course,  that  the  sam¬ 
ple  size  is  large  enough  for  the  asymptotics  to  be  useful! 
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4  r  .<tin  conclusions  of  the  paper 
4.1  Equivalent  sample  sizes 

Our  main  conclusions  are  in  terms  of  the  minimax  rates  of  convergence  of  the 
estimation  of  /  based  on  n  observations.  The  maximum  is  taken  over  /  in  the  smooth¬ 
ness  class  Bp  and  the  minimum  is  over  all  possible  estimators  based  on  a  data  set  of 
size  n. 

Suppose  the  observations  are  taken  directly,  in  other  words  that  the  exact  position 
of  each  emission  in  brain  space  is  known.  Then  we  shall  show  that,  for  fixed  p  and  C, 
the  minimax  mean  integrated  square  error  in  the  estimation  of  /  is  exacdy  of  order 
where 

roin)  =  CD„(lognJn)P^^P-*-^\  (4.1) 

and  C£,  „  is  a  sequence  of  unspecified  positive  constants  bounded  away  from  zero  and 
infinity.  On  the  other  hand,  for  an  indirect  sample  of  size,  n,  where  only  the  emissions 
as  actually  recorded  in  detector  space  are  observed,  the  minimax  mean  integrated 
square  error  is  exactly  of  order  ry(«),  where 

r,{n)  =  (4  2) 

and  Cf  „  is  again  a  sequence  of  unspecified  positive  constants  bounded  away  from  zero 
and  infinity. 

It  is  now  straightforward  to  draw  some  qualitative  conclusions  about  equivalent 
sample  sizes.  Suppose  we  have  an  indirect  sample  of  n  detected  emissions.  Let  m(n) 
be  the  equivalent  direct  sample  size,  that  is  the  number  of  emissions  whose  original 
positions  in  the  brain  would  allow  us  to  estimate  /  with  the  same  accuracy.  Some 
simple  algebra  from  (4.1)  and  (4.2)  yields  the  order  of  magnitude  of  the  equivalent 
sample  size  as 

^D,n 

Perhaps  not  surprisingly,  the  equivalent  sample  size  depends  in  order  of  magni¬ 
tude  on  the  smoothness  assumptions  made  on  the  density  /.  The  smoother  /  is 
assumed  to  be,  the  larger  will  be  the  index  p.  Hence  for  very  smooth  densities  the 
power  in  (4.3)  will  be  close  to  1  and  little  will  be  lost  as  a  result  of  the  indirect  nature 
of  the  observation  process.  However,  in  reality,  we  ought  not  to  assume  that  the  true 
emission  density  necessarily  varies  very  smoothly,  since  tissue  boundaries  and/or  local¬ 
ised  areas  of  high  metabolic  activity  may  lead  to  discontinuities,  certainly  in  high 


-  £±1 

^  (4  3) 
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derivatives  of  /  and  possibly  in  /  itself.  The  value  p  =  \  would  correspond  to  /  hav¬ 
ing  square-integrable  first  weak  derivative.  These  matters  are  discussed  further  in  Sec¬ 
tion  5. 

Immediate  quandtative  conclusions  cannot  be  drawn  directly  from  (4.3),  both 
because  the  constants  Crj  „  and  C;  „  are  not  known  and  because  we  lack  information  on 
the  accuracy  of  (4.3)  for  a  fixed  finite  value  of  n.  Somewhat  more  precise  statements 
can  be  made  for  the  linear  orthogonal  series  estimators  studied  in  Sections  6  to  8 
below.  Let  rujin)  and  ruin)  denote  the  minimax  mean  integrated  square  error  when 
considering  only  linear  estimators.  Since  it  is  still  not  possible  to  evaluate  the  con¬ 
stants  in  (4.1)  and  (4.2)  for  //^(n)  and  ruin),  we  intnxiucc  certain  surrogate  error 
measures  in  Section  7  whose  ratio  to  mean  integrated  square  error  always  lies  in  the 
interval  [l-ntc,  l+mc],  where  me 

Our  measures  depend  on  certain  constants  Cj,  C2,  C3,  d^,  d2  and  d^,  the  values  of 
all  of  which  are  given  in  Section  8  below.  The  constants  c,-  depend  on  both  smooth¬ 
ness  p  and  radius  C,  while  the  di  depend  only  on  p.  For  x>  1  define  a  function  aix) 
as  the  solution  to  aloga=x;  then  aix)  can  be  found  numerically  when  required  and  is 
asymptotic  to  x/logx  for  large  x.  Defme  a  sequence  by 

Then  the  minimax  surrogate  risks  rhjin)  and  ruin)  satisfy 
rwin)  =  Cin"^77„(log77„-HC2)  +  G(«'Si) 

=  dxC^^P*'^H\ognlny'^P*'^'>  -(-  0[n-P'^P^^\\ogn)-^'^P*'^^]  (4.5) 

and 

r*u^n)  =  d2C^^P^^'^n-P'^P*^^  +  Oin'^P^^^^P-^^hogn)  .  (4.6) 

The  form  (4.5)  for  is  more  transparent,  but  the  error  term  has  the  same  poly¬ 
nomial  order  as  the  leading  term;  the  error  term  in  (4.4)  is  of  lower  order  and  so  we 
use  (4.4)  in  numerical  computations. 

The  risks  (4.5)  and  (4.6)  naturally  increase  with  the  constant  C,  which  specifies  a 
radius  of  the  smoothness  class  Bp  c-  For  definiteness  we  take  =  2F~^;  this  turns 
out  (see  Section  5)  to  be  the  largest  value  Tor  which  all  /  in  ^  are  guaranteed  to  be 
non-negative  probability  densities,  so  long  as  p^l.  As  C  increases,  increases  faster 
than  ru)  in  relative  terms;  this  is  a  further  consequence  of  the  greater  difficulty  of  the 
indirect  estimation  problem. 


K- 


IT*  '“y  ^  ■“ 
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TABLE  4.1 

Equivalent  direct  sample  sizes  m{n)  to  achieve  the  same  surrogate  minimax  mean 
integrated  square  error  over  smoothness  class  p  as  for  an  indirect  sample  of  size  n. 


fl=10^ 

/j=10* 

ratio  m(l0^)/m(l0p) 

P=1 

1.93x10^ 

1.03x10® 

5.34 

II 

4.85x10^ 

3.12x10®  . 

6.44 

p=5 

1.29x10® 

1.05x10'' 

8.09 

The  explicit  equivalent  sample  size  m(n)  for  given  values  of  n  and  p  follows  by 
ignoring  the  error  terms  in  (4.4)  and  (4.6),  and  then  solving  r^jj^m)  =  rfj(n)  for  m. 
Some  representative  cases  are  given  in  Table  4.1.  As  expected,  the  equivalent  sample 
size  increases  as  the  assumed  amount  of  smoothness  rises.  If  technology  allows  an 
order  of  magnitude  increase  in  the  amount  of  data  collected,  then  the  equivalent  direct 
sample  sizes  increase  by  a  factor  of  between  5  and  8,  this,  factor  itself  increasing  with 
assumed  smoothness. 

In  summary,  our  results  confum  intuition  that  for  the  PET  problem,  the  amount 
of  information  available  is  still  substantial,  but  it  is  by  no  means  as  great  as  if  a  sam¬ 
ple  of  10^  direct  observations  were  available. 

4.2  Comparing  the  doubling  ratios 

We  now  return  to  general  non-linear  estimators  and  use  the  results  of  Section  4. 1 
to  compare  the  doubling  ratios  for  the  direct  and  indirect  estimation  cases.  This  will 
yield  an  alternative  interpretation  of  the  rates  of  convergence  (4.1)  and  (4.2).  The  dou¬ 
bling  ratio  for  the  direct  case  can  be  found  by  solving  the  equation  rp{il)lrp{n)  =  J 
numerically,  if  we  neglect  finite  sample  variation  in  the  "constant"  of  proportionality 
Cd  JcD,n<  which  we  conjecture  approaches  1  as  n—y^,  and  is  in  any  case  bounded 
away  from  zero  and  infinity.  For  the  direct  case,  because  of  the  logarithmic  term,  the 
doubling  ratio  depends  on  n  and  for  p  =  2  and  n  =  10^  is  approximately  equal  to  9. 
Thus  if  the  data  could  be  observed  directly  approximately  9  times  as  many  data  points 
would  be  required  to  halve  the  RMS  error.  For  the  indirect  case  there  is  no  log  term, 
and  so  as  long  as  n  is  large  and  Cj  Jcj  n  is  ignored,  the  doubling  ratio  will  be  approxi¬ 
mately  equal  to  16  for  p  =  2.  The  ratio  /(p)  of  the  doubling  ratio  for  the  indirect  case 
to  that  for  the  direct  case  can  be  taken  as  a  measure  of  the  inefficiency  incurred  due  to 
the  indirect  observation  of  the  data.  For  p  =  2,  the  calculations  given  here  yield 
1(2)  =  16/9  approximately.  As  one  would  expect,  the  inefficiency  /(p)  increases  as 
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the  amount  of  smoothness  assumed  decreases;  sec  the  values  plotted  in  Figure  4.1.  Of 
course,  any  interpretation  of  the  doubling  ratio  must  be  guarded  because  the  calcula¬ 
tions  we  have  performed  take  no  account  of  finite-sample  phenomena  that  disappear  in 


I  3  }  «  « 


Figure  4.1;  Inefficiency  I(p)  for  indirect  estimation  relative  to  direct  estimation  as¬ 
suming  p  square -integrable  derivatives. 

5.  The  Radon  transform  and  smoothness  classes 

The  normalised  Radon  transform  (2.1)  expresses  the  observed  density  g  as  a 
linear  transform  Pf  of  the  unknown  emission  density  /;  In  this  section  we  shall  study 
the  singular  value  decomposition  (SVD)  of  P;  this  decomposition  is  the  key  to  our 
study  in  subsequent  sections  of  the  loss  of  information  about  /  due  to  indirect  observa¬ 
tion. 


I 


To  establish  notation,  we  follow  Davison  (1983)  and  review  briefly  the  singular 
value  decomposition.  Let  H  and  K  be  Hilbert  spaces  and  P  :  H-^K  a  bounded  linear 
operator.  Under  suitable  conditions,  there  exist  orthonormal  sets  of  functions  {97^}  in 
H  and  [yr^}  in  K,  and  positive  real  numbers  [by),  the  singular  values  of  P,  such  that 


(i) 


Pq>y  =  by\ffy. 


SO  that  given  /  in  H, 


Pf  =  ^y{f,<Py)¥vr 
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(ii)  the  { span  the  orthogonal  complement  of  the  kernel  of  P,(ker  P)^; 

(iii)  the  [yfy]  span  the  range  of  P\ 

(iv)  given  g  in  K,  the  /  of  smallest  norm  which  minimises  ||P/  -  g||  is 

f  ='lPv^^S^¥v)<Pv 

V 

Thus  P  is  diagonal  in  the  bases  [(Py]  and  [^fy).  If  a  singular  value  by  is  small, 
then  noise  encountered  in  estimation  of  the  component  of  /  along  <Py  will  be  amplified 
by  a  factor  of  by^ .  Some  form  of  regularization  method  (Tikhonov  and  Arsenin, 
1977)  is  needed  to  deal  with  this  instability,  and  one  such  method,  based  on  tapered 
orthogonal  series,  will  be  introduced  in  Section  7  below. 

In  the  PET  model  described  in  Section  2,  H  is  the  space  )  of  functions  on 

brain  space  which  are  square-integrable  with  respect  to  the  dominating  measure  fi, 
where  B  =  {;t6R^,  ||jt||^l).  Correspondingly,  K  is  the  space  L^(D,X)  of  detector- 
space  functions  square-integrable  relative  to -I,  where  D  =■  {(r,^)  :  0^^^2;r}. 

The  operator  P  :  L^(B,^)  -*  L^(D,X)  is  the  normalised  Radon  transform  of 
(2.1).  As  mentioned  there,  it  can  be  thought  of  as  an  averaging,  or  conditional  expec¬ 
tation  operator.  Suppose  that  X  =  (Xj  ,^2)  is  drawn  at  random  (according  to  /i)  from 
brain  space  fi.  If  a  direction  q>  is  specified  by  =  (cos  tp,  sin  <p),  then 

Pfis^ip)  ^  E[f{X)\u^-X  =  s] 

From  this  representation  it  follows  at  once  that  fi  is  a  bounded  operator  with  norm  1 
and,  by  arguments  involving  characteristic  functions,  that  it  is  one-to-one. 

The  SVD  of  the  Radon  transform  in  this  specific  setting  appears  to  have  been 
first  derived  by  workers  in  optics  and  tomography;  see  references  in  Marr  (1974,  Note 
in  Proof)  and  Deans  (1983,  Section  7.6). 

Since  the  underlying  spaces  are  two  dimensional,  we  need  double  indices,  specifi¬ 
cally  V  e  N  =  {(/,m)  :  m  =  0,1,2,...;  /  =  m,m-2,...,-m).  In  brain  space,  an 
orthonormal  basis  for  L^(fi,/i)  is  given  by 

iPyir,d)  =  (m+l)iZ^(r)e“®  v  =  (/,m)  e  N,  (r,0)  e  fi,  (5.1) 

where  denotes  the  Zernike  polynomial  of  degree  m  and  order  /,  familiar  in  optics; 
sec,  for  example.  Deans  (1983,  Section  7.6),  or  Bom  and  Wolf  (1975,  Chapter  9.2.1 
and  Appendix  VII).  ZemUce  polynomials  arise  naturally  from  a  study  of  the  action  of 
rotation  on  L^(fi,/t),  satisfy  the  orthogonality  relation  Z/+2,(r)Z/+2/(r)r  dr  = 
id/]  +  2s  +  l)“*dir.  and  can  be  expressed  in  terms  of  the  more  general  family  of 
Jacobi  polynomials.  These  and  other  properties  together  with  a  list  of  the  first  few 
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polynomials  arc  given  in  the  above  references. 

In  detector  space  the  corresponding  orthonormal  functions  in  L^(D,A)  are 

\ffy{5,(p)  =  V  =  (l,m)  €  N,  {s,<p)  e  D  (5.2) 

where  U^{cos  6)  =  sin  (/n+ 1)0/ sin  0  are  the  Chebychev  polynomials  of  the  second 
kind.  We  have  Pq)y  =  b^y/y,  with  the  singular  values  by  =  bi^  specified  by 

by  =  (m+l)-i  V  =  {l,m)eN  .  (5.3) 


The  relatively  slow  decay  of  the  singular  values  with  degree  m  (independently  of  /) 
suggests  that  the  costs  of  indirect  observation  in  the  PET  problem  are  not  inordinately 
large. 


Since  we  work  with  real  densities  /,  we  may  identify  the  complex  bases  (5.1)  and 
(5.2)  with  equivalent  real  orthonormal  bases  in  a  standard  fashion.  For  example 
/  =  Uv<Py  where 


V2Re( ?>/,„)  if />0 
'  <PQ.m  if  f=0 
V2Im( ?),  „,)  if /<0 


and  similarly  for  the  real  coefficients  ^  From  now  on,  we  suppress  the  tildes  in  the 
notation  and  use  whichever  basis  is  convenient. 


In  Section  7,  we  carry  out  a  worst-case  analysis  of  various  estimators  in  the  PET 
problem.  This  worst-case  analysis  is  performed  over  a  suitable  class  B  of  emission 
densities  /  in  brain  space.  For  reasons  of  mathematical  tractability,  this  class  is  taken 
to  be  a  particular  ellipsoid  in  the  Hilbert  space  H  =  In  terms  of  the  real 

version  of  the  orthonormal  basis  the  ellipsoid  B  is  specified  by  an  array  of  con¬ 
stants  {Oy}  and  a  threshold  C: 

B=[f=I/y<Py  : 


Our  immediate  goal  is  to  discuss  the  interpretation  of  this  ellipsoid  condition  for 
specific  arrays  { ) .  In  the  simple  case  where  [<Py)  is  the  sequence  of  trigonometric 
polynomials  on  a  bounded  interval  [0,2;r]  in  one  dimension  and 
ay~v~P ,  '^Oyfy  <  «>  if  and  only  if  the  periodic  function  f  has  p  square-integrable 
derivatives  on  the  interval.  Thus  ellipsoid  conditions  can  amount  to  the  imposition  of 
smoothness  and  integrability  requirements. 

To  describe  specific  ellipsoids  in  the  double-index  PET  problem,  it  is  useful  to 
transform  the  index  set  N  by  the  change  of  variables  j  -  (m+l)/2,  k  =  (m-l)l2  into 
the  lattice  orthant  N'  =  {j,k)  :  j  2:  0,k  S  0).  Now  let 
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Bp.c  =  {/:/oo  =  1.  Z  (y+iy’a+iy’|/;t|2  <c2}.  (5.4) 

0',/t)€W'\(o.o) 

It  is  shown  in  the  appendix  that  /  =  f{x\  ,X2)eL^(B,^)  lies  in  Bp  ^  if  and  only  if  / 
has  p  weak  derivatives  that  are  square  integrable  on  B  with  respect  to  the  modified 
dominating  measure  .-^2)  =  {p+'^)(\-X\-X2Ydp{xi  ,X2).  This  condition  is  of 

course  somewhat  weaker  than  requiring  square-integrability  for  p.  It  arises  because  a 
family  of  derivatives  such  as  (_dldx)<p^  is  orthogonal  (after  a  simple  linear  transforma¬ 
tion)  with  respect  to  P2  not  p\ .  When  iterated  p  times,  this  yields  a  condition  on 
Pp^\.  This  same  technical  phenomenon  also  occurs  in  recent  work  of  Cox  (1987).  To 
summarise,  in  spite  of  the  change  in  dominating  measure,  Bp  c  can  be  regarded  as 
imposing  a  set  of  smoothness  and  integrability  conditions  :  the  higher  p  is,  the 
smoother  are  the  functions  allowed  in  c- 

How  smooth  are  the  functions  that  we  are  trying  to  reconstruct?  In  X-ray 
transmission  tomography,  there  may  be  discontinuities,  or  at  least  sharp  jumps,  in  tis¬ 
sue  density  across  the  boundaries  of  various  regions.  As  noted  by  Natterer  (1980, 
1986),  functions  that  are  piecewise  smooth  with  jumps  only  along  smooth  curves  lie  in 
Sobolev  spaces  corresponding  to  p  <  \  square  integrable  (fractional)  derivatives.  In 
emission  tomography,  with  its  inherently  lower  resolution,  it  may  perhaps  be  reason¬ 
able  to  postulate  somewhat  smoother  emission  densities  of  the  labelled  metabolite.  In 
any  case,  our  theory  is  presented  for  arbitrary  values  of  the  smoothness  p  >  0  wher¬ 
ever  possible. 

To  ensure  that  elements  of  ^  ^ire  bona  fide  probability  densities,  some  further 
restrictions  are  needed.  To  have  total  mass  1,  we  require  /oo  =  1-  By  restricting  the 
constant  C  that  governs  the  ellipsoid  size,  we  can  ensure  that  f{x)  ^  0.  This  is  tradi¬ 
tionally  done  using  Sobolev  embedding  theorems,  but  in  the  series  setting,  a  direct 
argument  given  in  the  Appendix  leads  to  the  following  sharp  estimate: 

Ifp^l,  sup  sup  |/(x)  -  1|  ^  (5.5) 

This  bound  is  attained  only  by  linear  functions,  for  example 
fixi,X2)  =  1  +  Thus  if  p  ^  ^0  on  |x|  ^  1  so  long  as 

C  ^  Of  course,  the  more  smoothness  that  is  assumed,  the  larger  the  allow¬ 

able  values  of  C. 

Note  also  that  if  g  =  Pf,  then 

sup  |g(y)-l(  ^  sup  |/(x)-l(  (5.6) 

y  * 


since  is  an  averaging  operator. 
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6.  Some  density  estimators 

In  this  section  we  shall  construct  two  linear  estimators  of  the  density  /,  one  based 
on  direct  observations  from  /  and  one  based  on  "indirect"  observations  from  the  den¬ 
sity  g  in  detector  space.  Of  course,  in  practice  observations  cannot  be  taken  from  / 
itself  and  only  indirect  observations  are  possible.  Both  the  estimators  are  based  on 
orthogonal  expansions  in  series  derived  from  the  singular  value  decomposition  of  the 
Radon  transform  discussed  in  Section  5  above.  It  is  not  our  intendon  that  the  estima¬ 
tors  we  construct  should  actually  be  used  in  practice,  because  they  would  be  inefficient 
computationally  and  would  not  necessarily  yield  non-negative  estimates.  The  non¬ 
linear  methods  that  are  most  useful  in  practice  (see,  for  example,  Vardi,  Shepp  and 
Kaufman,  1985,  and  Jones,  Silverman  and  Wilson,  1988)  are  difficult  to  analyse 
mathematically;  in  any  case  we  shall  show  that  the  weighted  orthogonal  series  estima¬ 
tors  introduced  in  this  section  are,  in  a  certain  sense,  asymptotically  optimal  even 
among  non-linear  methods.  Orthogonal  series  estimators  of  probability  density  func¬ 
tions  have  been  the  subject  of  considerable  attention  in  the-  literature,  particularly  in  the 
univariate  case.  See,  for  example,  Silverman  (1985,  Section  2.7),  Wertz  (1978, 
Chapter  5),  Kronmal  and  Tarter  (1968)  and  Prakasa  Rao  (1983,  Sections  2.2  and  3.3). 

Suppose,  first,  that  we  have  independent  observations  drawn  from  the 

density  /  in  brain  space.  The  basic  idea  of  the  orthogonal  series  method  is  to  expand 
/  as  the  series 

/  =  57v?»v  (6.1) 

y 

and  then  to  obtain  suitable  estimates  of  the  coefficients  fy,  which  are  substituted  into 
(6.1)  to  give  an  estimate  of  /.  The  indices  v  in  (6.1)  are  each  double  subscripts  (/,m) 
as  explained  in  Section  5  above. 

Define  empirical  coefficients  ^y  by 

=  (6.2) 

i=l 

The  4y  depend  on  n,  but  this  dependence  will  not  be  expressed  explicitly.  Since,  for 
each  V,  E^y  =  ^(Pyfdfi  =  fy,  each  ^y  will  be  an  unbiased  estimator  of  the  correspond¬ 
ing  coefficient  fy.  For  reasons  explained  in  Silverman  (1985,  Section  2.7),  in  order  to 
obtain  a  good  estimate  of  the  density  /  it  is  necessary  to  ‘taper’  the  ^y  by  a  suitable 
collection  of  tapering  weights  Wy.  The  direct  estimate  f  of  f  that  we  shall  consider  is 
defined  by  setting 
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We  shall  see  in  Section  7  below  that  the  appropriate  choice  of  weights  will  depend  on 
the  smoothness  conditions  assumed  for  the  density  /. 

Turn  now  to  the  indirect  estimation  of  /  from  a  sample  Yj  from  the  density 
-g  =  Pf  in  detector  space.  It  was  shown  in  Section  5  above  that  the  expansion 
g  =  ^gy^y  of  g  in  terms  of  the  eigenfunctions  y/y  in  detector  space  is  obtained  by 
setting  gy  =  byfy,  where  the  array  (by)  is  as  defined  in  equation  (5.3).  Define  empiri¬ 
cal  coefficients  rty  by 

Vv  =  (6.4) 

i=l 

For  each  v,  rjy  will  be  an  unbiased  estimator  of  gy  and  so  by^rjy  will  be  an  unbiased 
estimator  of  fy.  As  before,  we  introduce  an  array  of  tapering  weights  Uy,  and  define 
an  estimator  f*  of  /  by 

fy  =  Uyby^TJy  ,  SO  that  /*  =  '^Uyby^TJyPy.  (6.5) 

It  will  again  be  the  case  that  the  ideal  choice  of  weights  will  depend  on  the  smooth¬ 
ness  conditions  assumed  for  /,  and  this  will  be  discussed  in  the  next  section. 


7.  Choosing  the  weights 

7.1  Results  for  a  fixed  true  density 

In  this  section  we  shall  investigate  how  the  weights  and  Uy  should  be  chosen 
to  yield  good  estimates  of  the  density  /.  We  shall  also  consider  various  consequences 
of  the  ideal  choice  of  weights.  Our  work  on  the  choice  of  weights  depends  on  some 
simple  sampling  properties  of  the  estimates  /  and  /*,  and  it  is  these  that  we  study  first 
of  all. 

For  any  fixed  density  /  define  constants  ffy  by 

cr^  =  var  y>y(Xi)  =  -  {\(Pyfdtif 

=  \<Pvf<U^  -  fy  (7.1) 

It  was  shown  in  Section  6  that 

E4y  =  fy  (7.2) 


and  it  follows  from  (7.1)  and  the  independence  of  the  X,-  that 
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var  -  n~^  var  <Py{X{)  =  n 


(7.3) 


We  shall  judge  closeness  of  the  estimator  /  to  the  true  density  /  by  the  mean 
integrated  square  error  £  J(/  -  f)^dti.  Though  some  authors  (see  for  example  Dev- 
roye  and  Gyorfi,  1985)  prefer  other  criteria  of  accuracy  of  density  estimation,  mean 
integrated  square  error  is  both  intuitively  attractive  and,  particularly  in  the  present  con¬ 
text,  mathematically  tractable.  Because  of  the  orthonormaJity  of  the  tpy  with  respect  to 

J(/  -  /)>  =  YUv  -  fvf  =  -  fyf.  (7.4) 

V 

Thus,  using  (7.2)  and  (7.3),  we  have 

£](/  -  ffdtl  =  X{^V  var  ^y  +  {EWy^y  -  fyf) 

V 

=  X{«-^w2<t2  +  (l-w,)V.^}  (7.5) 

V 

Some  simple  algebra  from  (7.5)  gives  the  optimal  choice  of  the  weights  Wy  in  the 
sense  of  minimum  mean  square  error.  To  minimise  (7.5),  we  set 

H'v  =  /«? /(/v  +  (7-6) 

and  then  the  mean  integrated  square  error  is  where 

A^direct(/)  =  S  (7.7) 

V 

It  is  important  to  note  that  the  optimal  weights  themselves  depend  on  the  unknown 
density  /,  and  so,  even  if  the  observations  Xj  were  known,  it  would  not  be  pos¬ 

sible  in  general  to  attain  this  optimal  mean  integrated  square  error.  The  quantity 
A/direct(/)  is  the  minimum  possible  mean  integrated  square  error  for  any  weight  array, 
and  should  be  thought  of  as  a  benchmark  against  which  any  actual  weight  sequence 
can  be  compared. 

For  the  indirect  estimator  /*,  some  very  similar  calculations  can  be  performed. 
Define  constants  Ty  by 

=  var  Y^yiVf)  =  -  g^,  (7.8) 

so  that,  as  before, 

var  T^y  =  n~^Ty. 


Making  use  of  (6.5)  we  have 

E\(J*  -  f9dp.  =  E  jjj*y-fyf  =  E  Ziby^Uylly-fy)^ 

V  V 


•  . 


1 

■( 
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=  var  7]y  +  (Eby  ^UyVy-fyf} 

V 

=  'Z{n~^by‘^u^T^  +  (l-Uyff^]  (7.9) 

V 

A  comparison  of  (7.5)  and  (7.9)  shows  that  the  squared  bias  term  has  the  same  form  in 
both  cases,  but  the  variance  terms  are  inflated  by  a  factor  of  by^  in  the  indirect  case. 
To  counteract  this  inflation,  the  optimally  chosen  Uy  decay  faster  than  the  Wy  given  in 
equation  (7.6).  By  very  similar  manipulations  to  those  leading  to  (7.6),  the  optimal 
choice  of  weights  Uy  will  be  given  by 

Uv  -  fvl(Jv  +  n-'^by'^T^)  (7.10) 

and  the  minimum  possible  mean  integrated  square  error  for  the  density  / 

will  be 

A^indirectCT)  =  Z  bZ^} f} j +  n-^b:^})  (7.11) 


7.2  Results  for  densities  constrained  to  lie  in  a  given  class 

The  work  of  Section  7.1  gave  expressions  for  the  best  possible  mean  integrated 
square  error  for  a  particular  density  /.  By  contrast,  we  shall  now  find  densides  that  are 
least  favourable  to  estimate  within  a  pardcular  class,  in  terms  of  order  of  magnitude  of 
mean  integrated  square  error.  This  will  be  a  preliminary  to  the  discussion  of  the  com¬ 
parisons  between  direct  and  indirect  esdmadon.  We  shall  also  find  the  weights  that 
should  be  used  to  minimize,  in  order  of  magnitude,  the  maximum  mean  square  error 
for  densides  in  a  particular  class. 

Consider  a  smoothness  class  of  densides  B  =  ^p,c  defined  in  Secdon  5  above. 
Let  nic  =  and  assume  that  C  is  chosen  sufficiendy  small  that  me  <  1. 

From  (5.5)  it  will  follow  that  inf^  /  S  I -me  and  supg  /  ^  l+mp,  and  these  bounds 
on  /  will  imply  bounds  on  Oy.  For  any  v  +  (0,0),  we  will  have 

^\<pI  fdH^  (\+me)j(Py  dn  =  l+m^  (7.12) 

and 

o}  =  E[<Py{X{)  -  fy)'^  =  dfi 

^  (\-me)\{(Pv-fy9'dH  ^  (l-'«c)j9>v4u 

=  (7.13) 

since  j<Py  dfi  =  0  and  j(Py  =  1. 
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Since  (p^Q  is  constant,  we  have  =  0  for  all  /  and  so  the  weight  wpo  will 
always  be  set  equal  to  1.  The  sums  in  (7.5)  and  (7.7)  can  be  taken  over  all  v  except 
(0,0)  without  changing  their  value.  Let 

mo(/,w)  =  +  (l-w^)Vv  }  (7.14) 

V 

and  define 

Mo(f)  =  rno(/,w)  =  +  n-^).  (7.15) 


The  bounds  (7.12)  and  (7.13)  ensure  that  the  ratio  between  and  Mo(/) 

lies  in  the  interval  [1-mc,  l+mc],  and  so  is  uniformly  bounded  above  and  below  for 
all  n  and  all  /  in  B.  Hence  using  mQ(J,w)  as  a  surrogate  for  mean  integrated  square 
error  will  have  no  effect  in  order  of  magnitude. 

Consider,  now,  the  maximization  of  Mo(/)  over  B.  We  recall  a  lemma  of 
Pinsker  (1980,  Lemma  1).  Subtracting  a  Lagrange  multiplier  term  from 

it  can  be  seen  by  standard  calculus  that  the  maximizer  satisfies 

-  1)+  (7.16) 

and  that  the  maximum  value 

maxe  Moif)  =  n~^X^\-Y^ay)+  (7.17) 


where  Yo  is  chosen  to  ensure  that 


The  weight  array  corresponding  to  the  (Jy)  of  (7.16)  will  be  given  by 

Wv  =  (i-yK)+- 


(7.18) 

(7.19) 


Regarded  as  a  function  of  the  vectors  (fy)  and  (Wy),  mQ(J,yv)  is  convex  over 
(Wy)  for  fixed  {fj)  and  linear,  and  hence  concave,  over  ij^)  for  fixed  (Wj,).  The 
domain  of  both  (w^)  and  (j})  is  the  convex  set  of  all  non-negative  sequences.  It  fol¬ 
lows  by  the  usual  arguments  of  optimisation  theory  (see,  for  example.  Whittle,  1971) 
that  the  ‘maximin’  solution  of  (7.17)  is  also  a  solution  to  the  problem 


min  max 
w 


;;b  "»o(/.w) 

and  that  the  minimizing  weights  are  those  given  by  (*7.19).  Any  density  /  satisfying 
(7.16)  will  be  least  favourable  in  the  class  B.  The  quantity  in  (7.17)  will  be,  in  order 
of  magnitude,  the  minimax  mean  integrated  square  error  for  the  direct  estimation  of  / 


I.  M.  JOHNSTONE  &  B.  W.  SILVERMAN 


POSITRON  EMISSION  TOMOGRAPHY 


over  the  class  B. 

Now  turn  to  the  case  of  indirect  estimation.  The  nature  (2.1)  of  Pfis,<p)  as  an 
average  of  /  over  a  certain  line  segment  in  the  unit  disc  implies  that,  taking  inf  and 
sup  over  B, 

inf  f  <.  vcii  Pf  sup  Pf  ^  sup  / 

and  hence  that  the  bounds  (7.12)  and  (7.13)  also  apply  to  the  As  a  surrogate  for 
the  mean  integrated  square  error  J(/*-/)^dA  we  use 

^i(/.m)  =  'L[n~^b~'^u^  +  (l-Mj^/v^),  (7.20) 


and  again  the  relative  error  of  the  surrogate  is  at  most  me-  Analogous  arguments  to 
those  for  the  direct  case  show  that 


min  max 
u  /eB  u 


max  min 


(7.21) 


with  Yi  chosen  to  ensure  that 


(7.22) 


The  minimum  and  maximum  in  (7.21)  are  attained  by  choosing  as  optimum  weights 

=  (l-yK)+  ;  (7.; 


and  as  least  favourable  density  any  /  for  which 

fy  =  n"'b;^(rf^a;‘-l)+. 


(7.23) 


(7.24) 


To  summarize  this  section,  we  have  found,  within  a  bounded  factor,  the  minimax  mean 
integrated  square  error  over  the  class  B  for  the  estimation  of  /  both  from  direct  and 
indirect  observations.  In  both  cases  the  weights  are  chosen  as  best  possible  for  the 
class  B.  Although  the  expressions  may  appear  to  be  inelegant  and  intractable,  we  use 
them  in  the  next  section  to  obtain  orders  of  magnitude  for  the  minimax  mean 
integrated  square  errors.  Comparing  these  orders  of  magnitude  will  give  an  indication 
of  how  much  information  is  lost  by  using  indirect  rather  than  direct  observations. 
Although  we  have  restricted  attention  in  this  section  to  a  very  special  class  of  linear 
estimators,  we  shall  show  later  on  that  the  minimax  rates  are  not  altered  by  extending 
attention  to  all  estimators  of  /,  whether  linear  or  non-linear. 
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8.  Rates  of  convergence  for  orthogonal  series  estimators 

In  this  section  we  shall  make  use  of  the  results  of  Section  5  above  to  find  the 
orders  of  magnitude  of  the  minimax  mean  integrated  square  errors  for  direct  and 
indirect  observations.  The  calculations  involve  little  more  than  the  approximation  of 
sums  by  integrals  and  some  simple  calculus. 

We  shall  use  the  ij,k)  parametrisation  of  Section  5,  so  that  all  the  expansions  of 

Section  7  became  sums  over  (j,k)  with  j  =  0,1....  and  k  =  0,1 .  The  smoothness 

class  B  will  be  taken  to  be,  for  a  fixed  p  >  0, 

B  =  {/:  /oo  =  1  and  'ZU+ink+iyfji  ^  C^}  (8.1) 

so  that,  following  (5.4), 

aji  =  (j+iyik+\)P.  (8.2) 

As  discussed  in  Section  5,  the  singular  values  are  given  by 

bji^  =  (l+j+k)~^  for  all  j  and  k.  (8.3) 

The  following  integral  approximation  lemma  will  be  used  repeatedly.  It  is 
obtained  heuristically  by  approximating  sums  by  integrals  and  its  proof  is  given  in  the 
appendix.  As  in  Section  5,  the  notation  =  is  used  to  connect  two  quantities  whose 
ratio  is  uniformly  bounded  above  and  below  away  from  zero.  The  quantity  y  in  (8.4)  is 
Euler’s  constant,  0.57722...  . 

Lemma  8.1 

For  any  t],  let  denote  a  sum  over  [U,k):  l^(y+l)(/:+l)Sr/}. 

For  fixed  0,  as  p  tends  to  infinity, 

U+^Yik+^y  =  (r+l)-i77^-^‘{log77+2y-(r+l)-‘)  +  (8.4) 

and 

j:^^)(j+k+l)U+\nk+lY  =  j;r2(r+2)-‘ 77^+2  •  (8-5) 

Consider  first  the  direct  estimation  of  /.  We  will  have  y^a^  ^  1  if  and  only  if 
(y+l)(A:+l)  ^  yo^^P  and  so  the  (  )+  in  (7.17)  and  (7.18)  may  be  replaced  by  (  )  if 
the  sums  over  all  v  are  replaced  by  ^  =  /o Performing  this  replace¬ 

ment,  and  applying  (8.4),  equation  (7.18)  becomes 

=n-^ 

= 1(7)  {roHj+iy^Hk+irf^  -  ij+iYik+if} 
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=  n-^T]P^Hai\ogTi+bi)  +  n-^0{i)P^^),  (8.6) 

where  =p(p+l)"^(p+2)“‘  and  =2yai -{4(p+2)"2-(/7+l)"^).  Let  r;„=77„(C,p) 
be  the  solution  for  t]  of  equation  (8.6);  this  will  be  studied  below.  Apply  similar 
manipulations  to  the  minimax  surrogate  mean  integrated  square  error  (7.17)  to  obtain 

r^(n)  =  maxBAfo(/)  =  E(,)  (l-y^^v) 

=  p(p+2)~‘n"^r;„{logr7„+2y-(p+4)(p+2)“M  +  n"^0(7/i).  (8.7) 

This  yields  equation  (4.4)  with  the  values  for  the  constants  Ci=p(p+2)"^  and 
C2=2y-(/?+4)/(p+2).  It  is  easily  seen  that  77„~(rt/logn)^/^^'^*^  and  that 
maxgMoCf)  ~  (n/log 

We  can  obtain  more  precise  results  by  a  more  careful  analysis  of  equation  (8.6). 
As  in  Section  4,  denote  by  a{x)  the  solution  of  the  equation  aloga=x  The  substitu¬ 
tion  T7  =  exp(-fti/ai)y reduces  equation  (8.6)  to  the  form 

ylogy  =  af ‘C^fl(;>+l)exp{(/>-»-l)Z)i/fl, ).  (8.8) 

It  follows  that 

r/„  =d3(a(C3n))‘/(/'^i)  (8.9) 

with  C3=af 'C^(p+l)exp{(p+l)6i/ai )  and  d3=exp(-i>i/ai).  Substituting  back  into 
(8.7)  gives,  after  some  algebra  using  the  fact  that  a(x)=(x/logx){l+o(l)}  for  large  x, 

rLD<^n)  -  (8.10) 

where  =  (p-(-l)^/^/’^‘){p(p+2)-‘(p+l)"‘ 

It  was  shown  in  Section  7  that  the  true  minimax  mean  integrated  square  error  will 
have  asymptotic  behaviour  of  the  same  rate  as  its  surrogate,  and  hence  it  follows  m 

A 

particular  that,  as  for  the  estimator  /  as  defined  in  Section  6, 

min  max  EUf-f)^dfi  =  max  min  El(J-f)^d^  =  (n/log  (8.11) 

w  /€B  ■'  /€B  w 

Exactly  the  same  steps  can  be  carried  out  for  indirect  estimation.  This  time  set 
fj  =  yf and  write  equation  (7.22)  in  the  form 

Zoi)  U+k+^)[ri^U+^f^Hk+iy^^ -(j+ifik+if) 

=  +  n-^O(fl^^hogfl)-  (8-12) 

where  ^2  =  (^^/3)P(P+2)"*(p+4)“^  Ignore  the  error  term  and  solve  for  ff  to  obtain 
#)f„=(rtC^/fl2)*^^^^^^-  Substituting  back  into  (7.21)  and  performing  some  algebra  gives 
the  required  equation  (4.6)  for  the  minimax  surrogate  mean  integrated  square  error 
r^in),  with  the  constant  ^2  =  V3(p+4))^^^^^^Hp+2)^/^/’'"^\  Hence  it  follows, 
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using  the  equivalence  of  the  surrogate  mean  integrated  square  error  and  the  true  mean 
integrated  square  error,  that 

min  max  E\{f*-f)^dfi  =  max  min  E  \(J*-f)^dn  =  (8.13) 

u  /eB  ^  /eB  u  ^ 

Let  us  now  turn  to  the  question  of  equivalent  sample  sizes  discussed  in  Section  4, 
namely  the  (approximate)  solution  of  the  equation  rl£,(m)=rlj(n)  for  given  n.  A  sim¬ 
ple,  and  very  asymptotic,  calculation  uses  relations  (4.5)  and  (4.6),  with  the  error  terms 
ignored,  to  conclude  that 

m(n)  ~ 

where 

03  =  (p+l)(p+2)-l(di/d2)(^^')/^ 

_  c-2/(p+2)2(p+l)/2pl/(p+2)^^2y/p^p^2)“2-2(p+l)/p(p+2)j3^-2^^^^jj(p+l)/(p+2)_ 

This  apparently  fearsome  constant  decreases  innocuously  from  0.27  to  0.055  as  p 
increases  from  1  to  10  and  is  well  fit  by  the  function  0.006  + 0.533p“*  -0.276p"^ 
over  this  range! 

A  more  accurate  procedure  is  to  solve  r/|o(m)=r£/(n)  using  the  approximation 
(8.7)  for  rljjim)  instead  of  (8.10).  For  values  of  n  such  as  10^  and  10*  considered  in 
Table  4.1,  it  is  best  to  solve  for  m(n)  numerically  using  (8.7)  and  (4.6),  ignoring  the 
error  terms.  Each  evaluation  of  r^(m)  in  turn  requires  that  tj„  is  found  from  (8.9), 
performing  a  numerical  evaluation  of  the  function  a.  Table  4.1  contains  the  results  of 
this  procedure  for  selected  values  of  p. 

To  summarise  this  section,  we  have  shown  that,  for  the  linear  weighted  orthogo¬ 
nal  series  estimators,  the  optimum  rate  of  consistency  in  mean  integrated  square  error 
is  reduced  from  0[{nf log  n)“^A/»+i)}  (q  0(rt~Mp+2)j  jj  ^jj  ^  shown  in  the  next 
section  that  these  rates  of  consistency  are  both  best  possible  even  if  we  allow  the  class 
of  estimators  to  be  extended  to  cover  all  linear  and  non-linear  estimators. 
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9.  Lower  bounds 

In  this  section  we  establish  lower  bounds  on  the  rates  of  convergence  to  the  den¬ 
sity  /  of  estimators  based  on  direct  and  indirect  observations.  These  lower  bounds 
apply  to  arbitrary  (even  non-linear)  estimators,  and  show  that  the  minimax  rates 
obtained  by  the  simple  series  estimators  of  Section  7  cannot  be  improved.  The 
approach  is  based  on  Fano’s  lemma  of  information  theory,  as  developed  by  Ibragimov 
and  Hasminskii  (e.g.  1981)  and  Birg^  (1983),  although  a  slight  extension  of  Birgtf's 
formulation  is  needed  for  the  indirect  observation  case. 

We  write  /(x)  =  /(x,X'*)  for  an  arbitrary  estimator  of  the  unknown  density  f{x) 
based  on  n  direct  observations  X"  =  (A'i,...,Y„)  drawn  i.i.d.  from  f(xj.  Correspond¬ 
ingly,  we  write  /*(x)  =  /*(x,y")  for  an  arbitrary  estimator  of  /  based  on  n  indirect 
observations  Y"  =  drawn  i.i.d.  from  g(x)  =  (P/)(x).  The  notation  ^ 

will  denote  the  ellipsoid  (5.4),  with  C  chosen  so  as  to  ensure  that  all  members  are  pro¬ 
bability  densities. 

Theorem  9.1:  Let  p  ^  1.  There  exist  constants  c  =  c(p,C)  and  c'  =  c'(p,C)  such 
that  in  the  direct  estimation  problem 

inf  sup  £|1/-/||^  ^  c(n/logn)"'’/(/’'^‘\  (9.1) 

and  in  the  indirect  problem 

inf  sup  EWf-fW^  ^  c'n-pf^P^^K  (9.2) 

/  /eBp.c 

Remark:  It  will  be  apparent  from  the  proof  that  explicit  expressions  for  c  and  c' 
could  be  obtained.  However,  the  values  obtained  by  the  present  metf  d  may  well  be 
very  conservative,  and  our  present  goal  is  to  obtain  crude  qualitative  insights.  The 
problem  of  obtaining  exact  constants  in  (9.1)  and  (9.2)  is  open  and  perhaps  of  some 
technical  interest. 

The  convergence  rate  in  the  indirect  problem  clearly  depends  on  the  operator  P~^ 
mapping  the  observable  density  g  to  the  target  density  /.  One  convenient  approach  to 
computing  convergence  rates  has  two  parts:  (i)  compute  a  "modulus  of  continuity" 
T(e)  for  P~^,  and  (ii)  argue  that  a  lower  bound  to  the  minimax  convergence  rate  is 
given  by  (essentially)  T(n“^).  This  approach  separates  stochastics  and  analysis:  step 
(ii)  uses  the  information  theory  lemma  to  bound  the  estimation  error  by  r(n  '^)  while 
step  (i)  is  a  concrete  optimisation  problem  for  the  particular  operator  in  question.  This 
viewpoint  was  taken  recently  by  Donoho  and  Liu  (1987)  in  their  study  of  estimation  of 
linear  functionals. 


'■•I 
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We  begin  with  step  (ii),  which  is  formulated  in  slightly  greater  generality  than 
needed  for  Theorem  9.1.  We  then  return  to  step  (i)  to  compute  the  particular  moduli 
occurring  in  (9.1)  and  (9.2). 

Suppose  there  are  available  n  i.i.d.  observations  =  (Yj  ,...,Y„)  from  a  density 
g{y)dXiy),  yeD,  and  that  we  wish  to  estimate  /  =  Qg,  where 
Q  •  (^.ll*ll«)  is  a  linear  operator  between  the  inner  product  spaces  K 

andH. 

(For  the  PET  application,  in  the  indirect  case  we  take  H  =  and  K  as  the 

linear  subspace  of  L^(D,A)  which  is  the  range  of  the  normalised  Radon  transform  P  of 
Section  2.  Each  space  has  the  inner  product  norm  and  Q  is  taken  to  be  In  the 
direct  case,  H  =  K  =  L^(,B,ji)  and  Q  =  7.) 

Denote  by  /(Y^"^)  an  arbitrary  estimator  of  /  based  on  the  data  Y^"^  from  g. 
Lower  bounds  to  the  estimation  error  ||/(Y'‘)  -  /||//  depend  on  a  generalised  modulus 
of  continuity  of  the  operator  Q  mapping  g  to  /. 

We  suppose  that  geGcAT.  In  our  applications  the  parameter  space  G  will  be  a 
translate  V  +  g°  of  a  set  V  that  is  balanced  about  the  origin  (v6V=^-veV).  Typi¬ 
cally  g°  would  be  the  uniform  density. 

Suppose  that  A7  is  a  finite  dimensional  subspace  of  AT.  A  set 
U  =  Mr\[veK  :  iv||jf^r}  will  be  called  a  finite-dimensional  ball  of  radius  r;  we 
denote  the  radius  of  such  a  ball  by  r(t/).  The  dimension,  d{U),  of  the  ball  is  the 
dimension  of  M.  We  shall  need  a  dimension-normalised  measure  of  radius 
piU)  =  d(U)Mu). 

The  generalised  modulus  of  continuity  of  Q  over  the  parameter  space  V  (and  G) 
may  now  be  defined  as 


Tie)  =  sup 
t/cV 

p{U)=e 


inf 

v€(/:|lv|U=r(i;) 


WQAh, 


(9.3) 


where  the  supremum  is  taken  over  finite  dimensional  balls  of  the  type  described  above. 
Loosely  speaking,  t(£)  measures  the  increase  of  the  singular  values  of  Q  relative  to 
the  parameter  space  V  at  resolution  range  £. 

Notice  that  if  (2  is  a  linear  functional  (so  that  (77, 1 *11//)  =  (R,l‘|)),  the  above 
definition  reduces  to 


t(£)  =  sup  [\Qy\  :  l|v|Ijf  =  e  and  rveV  for 
which  is  the  modulus  of  continuity  studied  by  Donoho  and  Liu  (1987). 
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The  significance  of  the  modulus  functional  is  that  an  (often  sharp)  lower  bound 
for  the  rate  of  convergence  of  -  Qg\\  over  G  is  given  by  For  the 

proof  we  need  an  additional  assumption  bounding  the  Kullback-Leibler  information 
divergence  K{g„,gp)  =  J  \og{gJgfi)  gadX  over  G: 

For  some  5<oo,  K{g^,gp)  ^  B\\ga~gp\\\  for  all  g„,g^eG.  (9.4) 

Proposition  9.2  If  condition  (9.4)  holds,  there  exist  constants  C]  ,C2  such  that 

inf  sup  Eg  \\fn{y^''^)-Qg\\H  ^  CiT2(c2rt"i) 

/.  ifeC 


Proof 

1.  The  estimation  problem  is  at  least  as  hard  as  a  discrimination  problem.  Choose  a 
subset  F°  =  {/i ,...,/^)  c  F  =  (2(G)  that  is  2S~distinguishable  :  namely 

Wfa  ~  fgl  >  2^  if  Pick  a  corresponding  subset  G“  =  {gj . ^Af)cG  with 

ga^Q~^  fa  (fo*"  moment,  we  don’t  assume  that  Q  is  one-to-one).  Given  an  estima¬ 
tor  /,  define  a  discrimination  rule  ?>(y’^"^)  =  — >  F°  that  picks  the  closest 

element  in  F“  to  /(•,y^"^).  Hence,  if  4=  /  e  F°,  then  necessarily 

11/  -  /II  >  8.  Thus  the  maximum  risk  of  /  is  minorised  by  the  average  discrimination 
error  rate 

sug  £^||/-eg|P  S  «52^su^^P/||/-/i><5) 

2  I  P,.  W'Wa)- 

cr=  I 


2.  The  average  error  rate  in  discriminating  amongst  M  hypotheses  (where  here 

n 

Pgidy'')  =  riga(>’i)'^(<^i)  is  the  density  of  n  i.i.d.  observations  from  gg)  can  be 
»=i 

minorised  in  terms  of  the  Kullback-Leibler  discrepancies 


,  dP!t 

KiFi.Pl)  =  /  log  ^  <ip; 

dP p 


This  is  done  by  convexity  methods  (Fano's  lemma  of  information  theory,  see  Ibragi¬ 
mov  and  Hasminskii  (1981,  p.323),  and  Birgd’s  version  (1983,  p.l96))  and  yields 

u^^^KiP^,P^)  +  log  2 


M  SUE 

-jj  Z  Pai(P(Y"Hfg)  ^  1  - 

^  a=l 


log  (A/-1) 


i 


.  -r: 

■i 
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3.  Since  the  observations  are  i.i.d.,  K(Pa,P^)  =  nK(ga,gp).  Using  the  assumption 
(9.4),  the  conclusion  of  1.  and  2.  becomes 


sup  Eg\\f„-Qg\\^  ^ 


seG 


1  - 


«5,^sup  \\ga  - 


^  +  log  2 


log(M-l) 


(9.5) 


where,  by  construction,  Wga-QgpW  =  ll/o-/^ll  ^ 


4.  To  evaluate  the  lower  bound  (9.4),  we  relate  M  and  S,  the  cardinality  and  separa¬ 
tion  of  members  of  F°,  to  the  metric  dimension  properties  of  G.  The  idea  is  to  choose 
sufficiently  many  hypotheses  g^  {M  large)  that  are  sufficiently  close  (1^^  ~  ^^11 
small)  that  the  average  discrimination  error,  and  hence  the  maximum  estimation  risk,  is 
at  least  S^jl. 


From  the  definition  of  the  modulus  r(£),  choose  a  finite  dimensional  ball  C/cV 
so  that  piU)  =  £  and  inf{  1| QvH //:  vet/.H vjlj^  =  r(t/)}  t(£)/2.  Choose 
vi  ,...,vnfeU  such  that  ll  Va-Vy,!]  ^  ^  r(U)f2,  and  define  g^  as  (this  ensures  that 

g„€G).  Thus,  at  the  same  time  l|ga~^^lljc  ^  2r(U)  while,  firom  the  choice  of  U, 


WQga  -  Qgp\  ^ 

so  that  we  may  choose  IS  =  t(£)/4. 


Ilgg~g^ll  t(£) 

r{U)  4  • 


A  useful  lemma  of  approximation  theory  asserts  that  a  k  dimensional  ball  of 
radius  r  contains  an  r/2  distinguishable  subset  of  cardinality  at  least  2*  (e.g.  Lorentz, 
1966,  p.905).  Thus  M  ^  and  collecting  all  the  bounds,  the  right  side  of  (9.5)  is 
bounded  below  by 


f(£) 


.  _  nB4r^(U)  +  log  2 
log  (2‘'<^>-l) 


64  V  dm  J  64  ' 


where  C3  is  an  appropriate  constant  and  e  =  piU)  =  d{U)  ^r{U).  Now  set  £  so  that 
C3n£^  =  i  and  the  proof  of  Proposition  9.2  is  complete. 


Note:  Assumption  (9.4)  will  be  satisfied  if  the  densities  g^  and  gp  are  bounded 
above  and  below: 
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Indeed,  their  Kullback-Leibler  discrepancy  is  bounded  above  by  their  distance: 

=  J  log  dX 

gp 

^  Ci(Si.^2)  /  dX  ^  C2iBi,B2)\\ga-8pV. 

8a 

(since  -log  y  ^  1-y  +  c(y-l)^  fory  >  yo(^)  where yo(c)->0  as  c-^oo). 

The  condition  (9.6)  will  hold  for  the  smoothness  classes  ^  of  Section  5  so 
long  as  p  ^  1  and  C  <  ;see  (5.5)  and  (5.6). 

We  now  return  to  the  PET  setting  to  compute  the  moduli  T{e)  for  Theorem  9.1. 
In  fact,  we  don’t  have  to  solve  the  optimisation  problem  exactly  —  a  reasonable  lower 
bound  to  z{£)  will  suffice. 

Begin  with  the  direct  estimation  problem,  in  which  H  =  K~L^iB,^)  and 
Q  =  I.  Thus  G  =  BpC,8-f>  so  we  revert  to  the  notation  of  previous  sections. 
It  is  convenient  to  restrict  to  balls  of  the  form 

Uj{r)  =  {f:  ,/,  =  0on/‘^). 

ve/ 

(Since  we  always  deal  with  probability  densities,  from  now  on  we  make  the  conven¬ 
tion  that  /oo  =  1  ^tnd  that  J'oJ^  =  iV\(0,0).  )  Hence,  t^(£)  ^  rf(e)  = 

sup{r^  :  C/y(r)cG  and  =e^|/|}  =  £^sup{U|  :  i7y(£|y|i)cG).  To  pack  as 
many  dimensions  as  possible  into  G,  consider  /  of  the  form  -//^  =  { v  :  ^  77 )  = 

{;■  >  Q,k^  0  :  (j+l)(k+l)  ^  77*'^^}.  From  Lemma  8.1, 

|/,|  =  77i/Mog77»/'’{l +  0(1)}.  (9.7) 

Now,  if  S  C^77"^  then  Uj^{r)c:G  and  hence  Cj^{Cij~^)(zG.  This  says  that  the 
radius  of  the  ball  that  can  be  squeezed  into  G  decreases  as  the  number  of  dimensions 
("frequencies")  is  increased.  Thus  Ti(£)  ^  C^tj~^(£)  where  77(£)  is  the  solution  to 
£^\Jjj\  =  Using  (9.7)  we  find  that 

T^(£)  ^  M(£^  log  , 

SO  that  T^(cn"i)  ^  M(Sognlnyf^^^^^ , 

which  establishes  (9.1),  the  first  part  of  Theorem  9.1. 

For  the  indirect  estimation  problem,  we  take  //  =  L^(B,/i),  /iT  =  the  Hilbert 
subspace  of  L^{D,X)  generated  by  the  orthonormal  set  of  singular  functions  {|^^). 
We  take  Q  =  ,  so  that  Q\ffy  =  b~^<Py.  Functions  g&K  have  expansions  'LgvVyy  and 
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I 

I 

I  we  now  consider  balls  C/y(r)  =  {g  :  X  ^  =  0  o"  •  On  such  balls  it  is 

I  veJ 

,  easy  to  compute  a  lower  bound  to  using  the  singular  values  of  Q\ 

\  T^(e)  ^  T?(£)  =  sup{r^  inf  :  Uj{r)<zQ  and  =  e'^\J\ } 

■  ve7 

I 

=  £2{|y|  inf&;2  :  t/K£|yli)cG} 

The  parameter  space  G  =  P(B^.c)  =  (^  :  S  ^  Now  we  seek  not  only 

V 

to  pack  many  dimensions  into  G,  but  also  to  use  subspaces  where  =  j+k+l  is 
large.  Consider  therefore  J  of  the  form 

=  {;•  ^0,k^0  :  0+l)(jl:+l)  ^  tj^/p,  j+k+l  ^  an^fp). 

In  the  appendix  it  is  shown  that 

U,1  =  +  0(1)  as  77-400.  (9.8) 

Now  S  X«w  ^  if  Eg?  ^  C2r7-(P+i)//’. 

*^17  Jv 

Hence  U j^{Ct]~^p^^^I^)c:G  and  thus 

r?(£)  ^  C^7j{£)~^P'^^^Ip  aij^^P  =  C^77“*(e), 

where  this  time  ;;(£)  is  the  solution  to  £^|y^l  =  ,  which  yields 

xl{e)  >  Me^l^P'^^^.  Consequently,  r^(cn“i)  S  Mn~P^^P'*'^\  which  establishes  (9.2) 
and  completes  the  proof  of  Theorem  9.2. 

Remark 

Ibragimov  and  Hasminskii  (1981)  and  Stone  (1982)  have  shown  that  the  minimax 
rate  of  convergence  of  global  mean  integrated  square  error  for  direct  nonparametric 
density  and  regression  problems  is  n~W(27»+<i)^  where  p  is  the  assumed  amount  of 
smoothness  and  d  is  the  dimension,  d=2  in  our  case.  They  consider  classes  of  func¬ 
tions  constrained  by  a  Holder  continuity  condition  of  order  ae(0, 1]  on  the  deriva¬ 
tive,  so  that  p=s+a.  The  extra  logn  term  in  the  rate  of  convergence  (logn/n)^/^^'*"'*^ 

f 

obtained  in  the  present  paper  reflects  the  slightly  reduced  smoothness  imposed  by 
requiring  only  square-integrability  of  the  p*  weak  derivative. 
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10.  Some  remarks  on  sensitivity  to  assumptions 

In  Sections  2  to  5  we  formulated  a  particular  idealised  model  for  the  PET  prob¬ 
lem,  together  with  one  formal  structure  for  comparing  the  minimax  error  of  estimators 
and  the  costs  of  indirect  observation.  Obviously  it  is  important  to  ask  how  the  results 
change  with  modifications  to  the  assumptions  or  error  criteria.  We  discuss  two  rather 
different  such  perturbations  in  this  section.  While  this  is  certainly  not  a  complete  sen¬ 
sitivity  analysis,  the  results  obtained  below  are  broadly  consistent  with  those  of  the 
previous  sections. 

10.1  Alternative  error  measures  taking  account  of  derivatives 

Our  use  of  integrated  squared  error  to  measure  discrepancy  between  estimator  and 
unknown  function  has  shortcomings.  For  example,  being  a  pointwise  measure  pays 
no  attention  to  local  spatial  variations.  Our  results  extend  easily  to  a  class  of  losses 
that  also  measure  the  closeness  of  derivatives  of  the  estimate  to  the  true  unknown 
function;  these  losses  take  more  account  of  the  "shape"  of  the  function  than  does  L^. 

It  is  noted  in  the  Appendix  that  the  norm 

+J  2  |£>;d,'/P4“,.i  (10.1) 

r,5a0 

r+s=q 

is  equivalent  to  the  norm 

1 1  I/I  1 1^  =  Z 

jMO 

Thus  errors  measured  in  norms  equivalent  to  (10.1)  will  be  equivalent  to  loss  functions 
of  the  form  L(f,f)  =  Zv^vC/V'/v)^*  where  c^=c^|=(y+l)’(^-t-l)^. 

To  extend  the  upper  bound  results  of  Section  8  to  this  setting,  note  that  the 
minimax  risk  for  linear  estimates  over  the  ellipsoid  Bp  c  is  given  by 

where  in  the  direct  case  and  tyb~^  in  the  indirect  case,  and  A  is  chosen  to 

satisfy  n~^X~^'^(^^^(ayCyVy-ayVy)=C^.  So  long  as  q<p,  we  obtain  rates  of  conver¬ 
gence  equal  to  and  direct  and  indirect  cases 

respectively. 

It  follows  by  an  extension  of  the  method  of  Section  9  that  these  are  in  fact  the 

exact  rates  of  convergence.  The  2^distinguishable  set  F**  and  the  ball 

Uj{r)=[f:  '^Cyfy^r'^,  /„=0on/‘^}  are  now  constructed  in  the  1 1  M  1 1  norm.  By 
veJ 


defining 
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one  ensvires  that,  for  some  fixed  as (0,1), 
sup  infill-  S 

‘''W  1 1  I/I  1 1" 

With  these  changes,  the  lower  bounds  can  be  derived  from  (9.5)  just  as  in  Section  9. 

10.2  Results  for  the  length-biased  sampling  model 

Here  we  explore  the  effects  of  incorporating  tube-length  sampling  bias  into  the 
idealised  model  of  Section  2,  We  are  not  able  to  obtain  equal  upper  and  lower  rates 
of  convergence,  but  the  ratio  of  our  bounds  grows  more  slowly  than  log  n.  Since  the 
upper  rate  is  the  same  as  for  the  basic  model  of  Section  2,  we  conclude  that  ignoring 
the  tube-length  bias  does  not  have  a  great  effect  on  our  overall  qualitative  conclusions. 


k 

o 

Figure  10.1:  Vertical  cross-section  of  a  detector  ring 

The  tube  length  bias  arises  because  the  direction  of  emissions  is  uniform  in  R^, 
not  just  in  the  plane  of  detectors.  Consider  a  detector  tube  of  length  /  and  height  h 
(see  Figure  10.1),  and  vertical  lines  Li  with  abscissae  Oj  (i=l,2).  An  emission  is 
detected  if  the  two  photons  hit  the  opposite  ends  of  the  tube.  We  assume  that  /i«/ 
and  hence  that  the  emission  density  is  constant  in  the  vertical  direction.  If  an  emission 
occurs  with  abscissa  Oj,  denote  its  probability  of  detection  by  PiD\ai).  A  simple 
symmetry  argument  shows  that  P{D\a{)  docs  not  depend  on  the  value  of  ai  e  [O,/]. 
Indeed,  let  Paj|a,(>’2lyi)  denote  the  conditional  density  of  the  intercept  on  L2  of  an 
emission  at  yj  on  Lj.  This  density  is  symmetric  in  its  arguments: 
Pfl,|«i.(>’2l>'i)  =  Pa^\a,^y\\y2)-  Writing  for  the  collection  of  intercepts  (^1,^2) 

determining  a  line  hitting  both  ends  of  the  tube,  we  see  that 

,  dfv] 

P{D\ai)=  j  Pa,|a.(>'2lyi)  -j;- 
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=  J  Pa,\aM\y2)  ^ 

S"*2 

=  PiD\a2). 

Since  P(Z)|ai)  is  independent  of  and  A  <s;  /,  it  follows  that 

P{D\a{)  =  />(D|0)  =  tan-i  t]  +  tan"!  [-^1  ^  =  ~ 

^  /  l  n  TCI 

K  J  L  J 

Thus  the  probability  of  detection  is  inversely  proportional  to  the  length  of  the  tube, 
namely  for  a  tube  with  co-ordinates  The  dominating  measure 

X(ds,dq>)  =  2}c~^il-s^)^ds  dq)  in  detector  space  should  therefore  be  adjusted  by  a 
factor  proportional  to  (1-^^)“^  yielding  a  new  dominating  measure 
X2ids,d<p)  =  ds  d(()l2K,  which  is  just  Lebesgue  measure  on  detector  space. 

Alternatively,  one  can  explicitly  build  the  three  dimensional  structure  into  the 
derivation  by  assuming  that  there  is  a  detection  cylinder  of  height  h,  not  necessarily 
small.  The  observed  density  of  detected  lines  (which  now  has  four  arguments)  is  a 
weighted  Radon  line  transform  of  the  unknown  emission  density.  In  the  limit  as  h-»0, 
one  recovers  the  planar  version,  incorporating  the  tube  length  correction,  which  has  the 
form,  up  to  a  proportionality  constant, 

g(s,<p)dsd<p  =  4h^gois,q>)  +  dsdq> 

where 

gQ{s,<p)  =  J  fiscosq)  -  fsin?>,ysin?>  -f-  tcos<p)dt 

as  before. 

A  new  difficulty  arises  because  gQ{s,<p)  does  not  in  general  integrate  to  1  against 
dsdfllK.  Indeed  it  may  be  checked  that 

W)  =  goi.s.'P)  ^ 

=  J  2  (10.2) 

m^O 

even 

where  Kim)  =  j^^^il-msin^d)~^d6  is  the  complete  elliptic  integral  of  the  first  kind, 
and  is  the  Fourier  coefficient  of  /  for  the  (radial)  singular  function  ^^(x)  with 
V  =  (0,m).  To  obtain  a  probability  density  in  detector  space  (relative  to  dsdfjln)  we 
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g{s,<p)  =  (P2f)is,<p)  =  pif) 


1 

2{\-sh^pif) 


j  f(s cos q>-t sing),  ssing>+tcosg>)dt 


(10.3) 


We  are  unable  to  duplicate  the  analysis  of  previous  sections  for  lack  of  the  explicit 
singular  value  decomposition  of  P  :  ->  L^(D,dsdg>)  (in  fact,  we  don’t  even 

know  if  P  is  L^-bounded  here!).  Instead  we  adopt  some  ad  hoc  devices. 

To  compute  an  upper  bound  via  orthogonal  series,  we  modify  the  basis  functions 
in  detector  space  into  a  non-orthogonal  set  =  4;r“*(l-i^)Vv.  which  nevertheless 
have  the  property  that  =  P~^(f)jv^yPf  dX  =  p~^byfy.  From  the  empirical 
Fourier  coefficients  Tjy  =  n~^'^\/y(Yi),  and  assuming  for  the  moment  that  p  =  p(J)  is 
known,  we  form  the  natural  estimate 

fix)  =  (10-4) 

V 

where  Uy  are,  as  always,  tapering  weights. 

In  practice,  the  bounded  linear  functional  p(J)  must  be  estimated,  which  we  can 
expect  to  do  at  rate  n~^.  Indeed,  since  boo  =  /qq  =  1,  an  -consistent  estimator  of 
P~^{f)  is  given  by  t)qo  =  =  4;r”^n"^X(l-5?)i.  This  indicates,  but  of 

i 

course  does  not  fully  prove,  that  the  rate  of  convergence  of  the  estimate  of  /,  which 
will  be  slower  than  n~^,  does  not  depend  on  whether  P(/)  is  known  or  not. 

We  thus  compute  the  best  choice  of  weights  («„)  and  find  the  worst  MSE  over 
Bp  c  for  the  resulting  estimator  of  the  form  (10.4).  First  we  note  some  inequalities: 

fr/4  ^  pif)  ,  sup^(/)  =  1  +  (10.5) 

and 

=  \ziyy{Y)  ^  E\\i^y\HY)  ^  (10.6) 

(The  first  inequality  in  (10.5)  follows  from  the  bound  K{m)  ^  nfl  in  (10.2),  and  the 
second  by  maximising  the  second  expression  in  (10.2).).  These  inequalities  allow  us 
to  use  the  surrogate  mean  square  error  (7.20)  in  the  same  fashion  as  in  Sections  7  and 
8.  Hence  we  obtain  the  same  rate  as  an  upper  bound  to  the  minimax  rate  of 

convergence  in  this  modified  setting. 

Even  though  we  don’t  have  the  SVD  of  P  here,  the  lower  bound  approach  of  Sec¬ 
tion  9  yields  some  information.  The  new  feature  emerges  in  bounding  the  term 
sup  II  Pfa  -  PfAl  in  (9.4),  where  ||'||o  is  now  the  norm  of  L^{D ,dsdg>l7jt). 

The  difficulty  is  tliat  yfy  =  by^Pg>y  arc  no  longer  orthogonal  or  of  norm  1.  We 
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employ  the  following  lemmas 

Lemma  10.1 

Let  [<Pi,...,q>n}  be  an  orthonormal  basis  for  a  linear  space  M  and  P  :  M  N  a. 
linear  operator.  Then  there  exists  a  subspace  M'czM  of  dimension  at  least  [m/2]  such 
that 

s^ll^/11  Vll/P  ^  2  ave(llP9>jlP)  ^  2  max(|| P?>,  ||2) 


Lemma  10.2 


WV'lmU  ^  I 


xji  sin^  (m+l)g 
sind 


d0  ~  logm 


as  m— >00, 


Lemma  10.1  is  used  to  isolate  a  subset  of  (in  Section  9)  and  a  corresponding 

subspace  of  dimension  l/^^^|/2  on  which  ||Pp  2  m^{by  1|  V'y Ho).  Now,  Lemma 

10.2  bounds  this  by  l^Xogrj.  Except  for  this  extra  factor  of  log 7;,  the  proof  continues 
as  before  to  yield  the  modified  lower  rate  bound  of  (n/logn)"^/^^'*'^^’ 

Of  course,  at  least  one  of  these  bounds  is  not  sharp,  but  their  ratio  is 
which  is  negligible  relative  to  any  positive  power  of  n. 

11.  Some  concluding  remarks 

This  paper  has  focused  on  lower  and  upper  bounds  for  one  particular  bivariate 
density  estimation  problem  for  indirect  data.  The  same  formalism  applies  to  many 
other  density  and  regression  estimation  pnablems.  The  celebrated  "unfolding"  problem 
for  sphere  size  distributions  is  an  example  involving  univariate  density  estimation  from 
indirect  data  and  the  singular  value  decomposition  of  the  Abel  transform.  For  recent 
results  and  further  references  on  this  problem,  see,  for  example.  Hall  and  Smith 
(1987),  Nychka  and  Cox  (1984),  Jones,  Silverman  and  Wilson  (1988)  and  Wilson 
(1988). 

Noisy  integral  equations  of  the  form  yi=(F/)(t,)+£,-  can  be  treated  using  our 
methods,  at  least  under  appropriate  assumptions  on  the  distributions  of  (?,•,£,).  For 
example,  if  the  observation  points  r,-  follow  a  known  distribution  X{dt)  and  the  errors 
c,  are  independently  Gaussian  (0,<t^),  then  the  information  divergence  between  the 
hypotheses  /i  and  fi  is  K{Pj"\Pj"^)  =  inj[Pfi{t)-Pf2{t))^X{dt),  so  that  the  lower 
bound  methods  of  Section  9  immediatedly  apply.  Upper  bound  results  are  given,  for 
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example,  by  Nychka  and  Cox  (1984). 

For  a  generic  one-dimensional  problem  with  singular  value  decomposition 
Pq>y=by<Py,  by-v~^,  and  with  ellipsoid  determined  by  corresponding  to  "a 

derivatives",  the  exact  minimax  rate  of  convergence  of  the  mean  square  error  in 
^-2a/(2a+2^+i)  should  be  compared  with  the  exact  rate  of  for  the 

corresponding  "direct"  case.  Related  calculations  for  a  large  class  of  one-dimensional 
convolution  equation  models  appear  in  Wahba  and  Wang  (1987). 
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Appendix:  Proofs 

A.l  Smoothness  and  ellipsoids 

In  this  section  we  indicate  the  proof  of  the  statement  immediately  following  (5.4). 
We  employ  Gegenbauer  (ultraspherical)  polynomials  as  normalised  in  Magnus, 
Oberhettinger  and  Sony  (1966,  p.218  ff  )  or  Gradshteyn  and  Ryzhik  (1980,  p.827). 
Define  probability  measures  on  B  by  d^a=a{\^\x\)^y*~^d^{x).  An  orthogonal 
basis  for  is  given  by  the  polynomials 

M*=(cos0,sin0)T.  j) 

Consider  the  conditional  expectation  operator  (P/)(5,0)=E{/(X)|m^^X=5 } 
where  We  view  the  polynomials  as  the  pullback  by  the  adjoint  P*  of  the 

singular  functions  of  PP*.  This  construction  of  the  SVD  of  P  is  explained 

in  Johnstone  (1988),  following  Davison  (1981,  1983)  but  with  different  notation  and 
normalisations. 

In  terms  of  the  basis  (A.l),  the  differentiation  operators  =\{dldxi  -  idldx^)  and 
Dj  =  ^{dl3xi  +idldx2)  have  a  very  simple  action.  Since  (d/dr)C^=2«C“l},  we  have, 
setting  Pji=0  if  jAk=0, 

Dz9jk  =  a^j‘Xk  and  = 

The  raising  of  the  index  from  a  to  a+1  leads  us  from  the  original  measure  n  of  Sec¬ 
tion  2  to  the  family 

It  is  shown  in  Johnstone  (1988)  that  if  /=  2  ^jkVjk  and  r+s=p,  then 

j.kiO 

i^r 

ki.s 

=  'Lcf^afkWfjkWl, 

J^r 

k^s 

where,  for  J^r  and  k^s, 

U+lfik+iyip+l)-^-^^  ^  af,,  S  (y+l)P(k+l)P(p+l)p^. 

It  can  also  be  shown  that  =  (l+m)"V/m»  where  [<Pim\  is  the  orthonormal  basis  as 
defined  in  (5.1).  It  follows,  by  standard  arguments  of  analysis  (see,  for  example,  Gil- 
barg  and  Trudinger,  1977)  that  /  =  Y/jkPjk  has  weak  derivatives  of  order  p  in 
L^{B,Pp+i)  if  and  only  if 

X  {Mnk-^\y\fjk\'^  <oo.  (A.2) 

j.kip 
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If,  in  addition,  /  is  required  to  be  in  L^{B,n),  then  the  sum  in  (A.2)  should  be 
extended  to  the  range  {(y.A:) :  ;^0,A:^0). 

A.2  Proof  of  (5.5) 

It  is  easily  seen  that  the  ellipsoid  B  has  exactly  the  same  description  in  terms  of 
either  the  real  or  the  complex  form  of  the  basis  (5.1),  as  long  as  ali  ^  =  olm‘  In  the 
latter  form,  we  find 

nr,e)  =  I  (m+l)ie"‘»Zi,(r). 

Zemike  polynomials  have  the  properties 

;  sup  IZ^Ml  =Zi,(l)=  I. 

These  may  be  derived  from  the  representation  in  terms  of  Jacobi  polynomials: 
^/+2i('‘)  =  together  with  the  results  of  Szego  (1939,  Section  7.2, 

p.l63),  applied  to  the  polynomials  ’^(2/-!)  as  s  varies.  Hence 

I/-1I  S  £  (m+l)i|/tol 

NMfl.O) 

Changing  the  index  set  from  (l,m)  to  (J,k)  as  in  Section  5,  and  setting 
Xjk  =  we  obtain 

sup  |/-1|  ^  C  sup  X  (;+*+ 1)^(7+ 

where  N'  =  {(y,/:)  :  j  ^  0,  k  ^  0],  and  the  supremum  is  taken  over  arrays  [Xjk)  with 

X  xfk  ^  1.  Thus,  so  long  as  p  ^  1, 

V'\(0,0) 

sup|/-l|^C  sup  (;■+;(+ l)ky+l)~'’'^(/t+l)"^'^ 

Bp,c  A/'\((f.O) 

=  C  sup  (m+l)i(m+l)-^'2 
=  C 

The  supremum  is  clearly  attained  when  m=l,  which  implies  that  the  corresponding 
density  /  is  linear. 

A.3  Proof  of  Lemma  8.1 

We  transform  the  sums  in  (8.4)  and  (8.5)  by  replacing  y+1  by  y  and  k+l  by  k\ 
denote  by  X[f;l  sum  over  the  transformed  range  {(y,^)  :y^l,/:^l  andy^<r;). 
Then  the  sums  (8.4)  and  (8.5)  become 
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and 

respectively.  Using  symmetry  in  (j,k),  we  decompose  the  first  sum  in  (A.3)  as 
S  ^  ifc  ^  2  ic  ^  J  ^  ^  ^  ^  ^  * 


(A.3) 


k=l 


>=1 


;=1*=1 


t 

From  the  relation  =  (r+l)  we  obtain 

;=i 

5  =  2(r+ir^^^k'l[vk-^y'^^+0(n''k-^)]  -  {(r+irHn^Y^^+OiT]^^^))^ 

k=l 

=  2(r+ 1)-J 77^+1  -^xV{(77/:-^)"+*-[77it-‘rM  "  +  Oiv'^^) 

k=l  k=l 

=  2(r+l)-l77"''Milog77+y+0(77-i)}  -  (r+U-^r/^+l  + 
which  yields  the  result  of  (8.4). 

For  the  second  sum  in  (A.3),  we  use  an  integral  approximation,  valid  for  and 

x>l, 

Ifj;*  =  (^+1)-1a:"^'  +  0^\c„\^Cs,  (A.4) 

;=i 

which  follows  from  tl..  bounds 


(j+D-i^r'  ^  t^ds  ^  zf  ^  ^  {s+irHx+ir^Y 

;=1 

To  evaluate  the  sums  employ  (A.4)  as 

follows,  assuming  throughout  that  r;  is  an  integer: 

77  [77*-‘] 

i[„r ‘t'  =  i  *'  £  r' 

)t=i  y=i 

=  Zk^ir+irHvk-^"*^  +  2:^'’c,+i.^.*(r7ri)^+i 

*=1  *=l 

=  (r+2)-‘77^+2|^r2  +  Oiv^^^ik-'^) 

1  1 

=  -7^^(^+2)"'r7'"^^  +  0(77'"^Mog77).  (A.5) 

6 


Since  the  sum  Z[rj]J'^l^'  has  already  been  shown  to  be  O log tj),  substituting 
(A.5)  back  into  (A.3)  completes  the  proof  of  (8.5). 
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A.4  Proof  of  (9.8) 

Throughout  this  proof,  let  ^  and  use  x  denote  the  indicator  function  of 

a  set.  In  terms  of  indicator  functions,  the  required  equation  (9.8)  can  be  rewritten  as 

\Jn\=  X  =  c^Ip  +  0(1).  (A.6) 

j.kil 

To  establish  (A.6),  define  xi^)  =  so  that  x  is  the  smaller 

solution  of  the  equations  xy=^,  x+y=a4+l.  Assume  that  ^  is  sufficiently  large  that  x 
is  real.  It  is  easily  seen  that  x(4)  =  +0i4~^). 

Provided  x  is  real,  the  sum  in  (A.6)  can  be  re-expressed  as 

j=i  k 

1*1 

=  21  {(r^-a)^+0-l)}  +  Oix)  =  c„^  +  0(l). 
y=i 


as  required. 
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